SlideShare a Scribd company logo
1 of 17
Download to read offline
Susang Kim(healess1@gmail.com)
Video Understanding(1)
Quo Vadis, Action Recognition?
A New Model and the Kinetics Dataset
Action Recognition 논문
DeepMind에서 발표한 논문(CVPR 2017)으로
Action Recognition을 위한 Two-Stream
Inflated 3D ConvNets(I3D)와 Kinetics
Dataset을 공개
Action Recognition : 특정 비디오 영상에서
사람이 어떤 행동을 하는지를 위한
Classification을 하는 것 (비디오 영상을
입력하여 예측 결과 출력)
Quo Vadis의 한장면:
Are these actors about to kiss each other,
or have they just done so?
⇒ Actions can be ambiguous in individual
frames
Kinetics Dataset - Human Action Video Dataset (본 논문에서 공개)
ImageNet(1000장/1000카테고리)으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라
Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로 Action
Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던 HMDB-51과
UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터 필요성의 중요함을
본 연구를 통해 보여줌.
Kinetics Dataset : 1천개 클래스 1천개의 video clips로 Human Action 중심(단독 행동, 사람 간 행동,
물건을 다루는 행동)으로 정의(클래스당 600 비디오 클립으로 10초씩) 처음 400클래스 공개 후
600클래스로 추가 trimmed videos로 구성
본 논문에서는 miniKinetics dataset(full Kinetics의 사전 실험용)은 동영상 테스트의 빠른 실험을 위한
dataset으로 213 class에 12만개의 clips로 class당 150~1000개의 clips로 구성(validation : 25 clips / test
75 clips)
A Short Note about Kinetics-600 https://arxiv.org/pdf/1808.01340.pdf
Action Recognition Benchmark
DATASET YEAR # ACTIONS # CLIPS PER ACTION
KTH 2004 6 10
Weizmann 2005 9 9
IXMAS 2006 11 33
Hollywood 2008 8 30-140
UCF Sports 2009 9 14-35
Hollywood2 2009 12 61-278
UCF YouTube 2009 11 100
MSR 2009 3 14-25
Olympic 2010 16 50
UCF50 2010 50 min. 100
HMDB51 2011 51 min. 101
first pretraining on Kinetics and then fine-tuning on HMDB-51 and UCF-101s a boost in performance
HMDB-51 Dataset
ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각
카테고리는 101개의 클립으로 구성
http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction
UCF101 Dataset
UCF101 Dataset : ICCV 2013에 공개된 데이터로 13320개의 비디오와 101개의 엑션 클래스가 있음
101개의 카테고리는 25그룹으로 나뉘는데 각각의 그룹은 4~7개의 액션이 정의된 비디오가 있음.
https://www.crcv.ucf.edu/data/UCF101.php
Video Architecture
ImageNet Pre-trained Model(Inception-v1)활용하여 기존 아키텍쳐(a~d)와 본 논문에서 제시한
I3D(e)+pre-training on Kinetics와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시
The Old I: ConvNet+LSTM
Long-term Recurrent Convolutional Networks for Visual Recognition
and Description(CVPR 2015) (https://arxiv.org/pdf/1411.4389.pdf)
25 Fps를 뽑은 후 각각을 CNN(Inception-V1:512 hidden units)으로
피쳐를 뽑아 LSTM(+batch norm)으로 시계열 정보를 예측
cross-entropy loss 사용
LSTM을 통해 연속적으로 학습해야하기 때문에 연산이 어려움
The Old II: 3D ConvNets
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional networks (ICCV 2015)
https://arxiv.org/pdf/1412.0767.pdf
C3D로 정의하는 3D CNN은 추가적인 커널 수로 2D보다 많은 파라미터 수를
가지므로 학습에 어려움이 있음 배치별 15개의 비디오 K40 GPU로 학습
H x W x D => T x H x W x D (시간축이 추가됨-위아래 앞뒤로 3D로 움직임)
당시엔 ResNet같은 2D의 최적화된 네트워크가 없어 아래와 같이 새롭개 정의한
네트워크로 학습
The Old III: Two-Stream Networks
Motion Estimation을 하는 방법 중 효율적인 Optical Flow를 적용 RGB(Spatial)와 Optical Flow(Temporal)의 Two Stream
Two-Stream Convolutional Networks for Action Recognition in Videos (NIPS 2014)
(https://papers.nips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf)
Convolutional two-stream network fusion for video action recognition (CVPR 2016)
https://arxiv.org/pdf/1604.06573.pdf
HMDB(Human Metabolome Database)에서 성능 개선을 얻은 방식으로 기존 방식에서 나온 피쳐를 복수개의
RGB와 Optical Flow로 추출하여 3D Conv로 학습시킴(3D Conv는 피쳐 추출이 아닌 loss 줄이기 위한 용도)
네트워크는 Inception-V1을 사용 10 Frame간격으로 5개의 연속된 RGB Frame을 Optical Flow의 피쳐와 합쳐
End-to-End로 학습
The Old IV: 3D-Fused Two-Stream
The New: Two-Stream Inflated 3D ConvNets(I3D)
본 논문에서 제안하는 방식으로 기존 2D로 피쳐를 뽑는 방식에서 복수
Frame인 RGB와 Optical flow를 동시에 3D Conv로 피쳐를 한번에 뽑아
합치는 방식으로 이를 통해 RGB의 연속된 움직임과 Optical Flow의
변화량을 정확하게 뽑아낼 수 있다고함.(새로운 3D 네트워크가 아닌
Inception v1을 쌓아 3D 만듬 - 검증된 네트워크 활용)
2D->3D(N × N filters become N × N × N : 디멘젼을 늘림)
3D Conv의 경우 복수개의 Frame와 Optical Flow가 입력이 되기에
ImageNet Pre-trained Model활용을 위해 기존 ImageNet의 모델을 N번(3D)
만큼 복사하여 붙임.(기존 weight 활용가능)
repeating the weights of the 2D filters N times along the time dimension
Two-Stream Inflated 3D ConvNet (I3D)의 명칭으로 제안함
I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow
information. We trained the two networks separately and averaged their predictions at test time
Implementation Details
Inception V1의 네트워크를 성능 개선을 위해 수정 공간정보와 시간 정보를 동시에 학습시키기 때문에 stride의 수 조정이 중요한데
stride가 크면 공간을 못보고 stride가 작으면 움직임(시간)을 못봄 (첫번째와 두번째의 Max-Pool에서는 시간축으로는 보지 않음)
RGB만으로도 각 Frame 피쳐의 변화를 통해 모션 예측이 가능하지만 Optical Flow를 통해 더 세부적인 모션을 예측할 수 있음
Implementation Details
All models but the C3D-like 3D ConvNet use ImageNet pretrained Inception-V1 as base network.
For all architectures we follow each convolutional layer by a batch normalization layer and a ReLU
activation function, except for the last convolutional layers which produce the class scores(1x1x1) for
each network.
Training on videos used standard SGD with momentum set to 0.9 in all cases, with synchronous
parallelization across 32 GPUs for all models except the 3D ConvNets(64 GPUs).
We trained models on miniKinetics for up to 35k steps, and for 110k steps on Kinetics, with a 10%
reduction of learning rate when validation loss saturated.
We tuned the learning rate hyperparameter on the validation set of miniKinetics. Models were trained for
up to 5k steps on UCF-101 and HMDB-51 using a similar learning rate adaptation procedure as for
Kinetics but using just 16 GPUs. All the models were implemented in TensorFlow
(https://github.com/deepmind/kinetics-i3d)
During training data augmentation random cropping both spatially (resizing the smaller video side to 256
pixels, then randomly cropping a 224 × 224 patch and temporally, random left-right flipping, photometric)
During test time the models are applied convolutionally over the whole video taking 224 × 224 center
crops, and the predictions are averaged. We computed optical flow with a TV-L1 algorithm
Experimental Comparison of Architectures
UCF-101/HMDB-51로
학습 한 모델과
Kinetics Pre-trained
사용(Cl에 따른 성능의
차이를 보여줌
Comparison with the SOTA and Next
Comparison with state-of-the-art on the UCF-101
and HMDB-51 datasets, averaged over three
splits. First set of rows contains results of models
trained without labeled external data.
As future work,
for other video tasks such as semantic video
segmentation, video object detection, or optical
flow computation
we have not employed action tubes or attention
mechanisms to focus in on the human actors.
we plan to repeat all experiments using Kinetics
instead of miniKinetics, with and without
ImageNet pre-training, and explore inflating other
state-of-theart 2D ConvNets
Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

More Related Content

What's hot

古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説suckgeun lee
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019Yusuke Uchida
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual FeaturesARISE analytics
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」Naoya Chiba
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNEDavid Khosid
 
帰納バイアスが成立する条件
帰納バイアスが成立する条件帰納バイアスが成立する条件
帰納バイアスが成立する条件Shinobu KINJO
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
 
動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセットToru Tamaki
 
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-Hirokatsu Kataoka
 
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)MLSE
 
【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge
 
[DL輪読会]Pyramid Stereo Matching Network
[DL輪読会]Pyramid Stereo Matching Network[DL輪読会]Pyramid Stereo Matching Network
[DL輪読会]Pyramid Stereo Matching NetworkDeep Learning JP
 
[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개Donghyeon Kim
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs Deep Learning JP
 
動作認識におけるディープラーニングの最新動向1 3D-CNN
動作認識におけるディープラーニングの最新動向1 3D-CNN動作認識におけるディープラーニングの最新動向1 3D-CNN
動作認識におけるディープラーニングの最新動向1 3D-CNNWEBFARMER. ltd.
 
[DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...
 [DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima... [DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...
[DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...Deep Learning JP
 

What's hot (20)

古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」
論文紹介「PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet」
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
帰納バイアスが成立する条件
帰納バイアスが成立する条件帰納バイアスが成立する条件
帰納バイアスが成立する条件
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット
 
Lucas kanade法について
Lucas kanade法についてLucas kanade法について
Lucas kanade法について
 
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-
【チュートリアル】動的な人物・物体認識技術 -Dense Trajectories-
 
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
【基調講演】『深層学習の原理の理解に向けた理論の試み』 今泉 允聡(東大)
 
【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習
 
ResNetの仕組み
ResNetの仕組みResNetの仕組み
ResNetの仕組み
 
[DL輪読会]Pyramid Stereo Matching Network
[DL輪読会]Pyramid Stereo Matching Network[DL輪読会]Pyramid Stereo Matching Network
[DL輪読会]Pyramid Stereo Matching Network
 
[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개[기초개념] Recurrent Neural Network (RNN) 소개
[기초개념] Recurrent Neural Network (RNN) 소개
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
 
動作認識におけるディープラーニングの最新動向1 3D-CNN
動作認識におけるディープラーニングの最新動向1 3D-CNN動作認識におけるディープラーニングの最新動向1 3D-CNN
動作認識におけるディープラーニングの最新動向1 3D-CNN
 
[DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...
 [DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima... [DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...
[DL輪読会]StarGAN: Unified Generative Adversarial Networks for Multi-Domain Ima...
 

Similar to I3D and Kinetics datasets (Action Recognition)

[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...Susang Kim
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰taeseon ryu
 
20210131deit-210204074124.pdf
20210131deit-210204074124.pdf20210131deit-210204074124.pdf
20210131deit-210204074124.pdfssusera9c46c
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentiontaeseon ryu
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervisiontaeseon ryu
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pubJaewook. Kang
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...Gyubin Son
 
[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...JaeYeongKo
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝NAVER Engineering
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AITae Young Lee
 
Tiny ml study 20201031
Tiny ml study 20201031Tiny ml study 20201031
Tiny ml study 20201031ByoungHern Kim
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Susang Kim
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용Youngjae Kim
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classificationKyeongUkJang
 
순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요Byoung-Hee Kim
 

Similar to I3D and Kinetics datasets (Action Recognition) (20)

LeNet & GoogLeNet
LeNet & GoogLeNetLeNet & GoogLeNet
LeNet & GoogLeNet
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Vid2vid
Vid2vidVid2vid
Vid2vid
 
20210131deit-210204074124.pdf
20210131deit-210204074124.pdf20210131deit-210204074124.pdf
20210131deit-210204074124.pdf
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attention
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
 
[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AI
 
Tiny ml study 20201031
Tiny ml study 20201031Tiny ml study 20201031
Tiny ml study 20201031
 
Detecting fake jpeg images
Detecting fake jpeg imagesDetecting fake jpeg images
Detecting fake jpeg images
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용
 
AUTOML
AUTOMLAUTOML
AUTOML
 
Automl
AutomlAutoml
Automl
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classification
 
순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요
 

More from Susang Kim

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...Susang Kim
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsulesSusang Kim
 
[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognitionSusang Kim
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)Susang Kim
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...Susang Kim
 
[Paper] auto ml part 1
[Paper] auto ml part 1[Paper] auto ml part 1
[Paper] auto ml part 1Susang Kim
 
[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer visionSusang Kim
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposalsSusang Kim
 
[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object DetectionSusang Kim
 
GroupFace (Face Recognition)
GroupFace (Face Recognition)GroupFace (Face Recognition)
GroupFace (Face Recognition)Susang Kim
 
제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)Susang Kim
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture noteSusang Kim
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용Susang Kim
 

More from Susang Kim (14)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules
 
[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
 
[Paper] auto ml part 1
[Paper] auto ml part 1[Paper] auto ml part 1
[Paper] auto ml part 1
 
[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
 
[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection
 
GroupFace (Face Recognition)
GroupFace (Face Recognition)GroupFace (Face Recognition)
GroupFace (Face Recognition)
 
제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture note
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
 

I3D and Kinetics datasets (Action Recognition)

  • 1. Susang Kim(healess1@gmail.com) Video Understanding(1) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • 2. Action Recognition 논문 DeepMind에서 발표한 논문(CVPR 2017)으로 Action Recognition을 위한 Two-Stream Inflated 3D ConvNets(I3D)와 Kinetics Dataset을 공개 Action Recognition : 특정 비디오 영상에서 사람이 어떤 행동을 하는지를 위한 Classification을 하는 것 (비디오 영상을 입력하여 예측 결과 출력) Quo Vadis의 한장면: Are these actors about to kiss each other, or have they just done so? ⇒ Actions can be ambiguous in individual frames
  • 3. Kinetics Dataset - Human Action Video Dataset (본 논문에서 공개) ImageNet(1000장/1000카테고리)으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라 Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로 Action Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던 HMDB-51과 UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터 필요성의 중요함을 본 연구를 통해 보여줌. Kinetics Dataset : 1천개 클래스 1천개의 video clips로 Human Action 중심(단독 행동, 사람 간 행동, 물건을 다루는 행동)으로 정의(클래스당 600 비디오 클립으로 10초씩) 처음 400클래스 공개 후 600클래스로 추가 trimmed videos로 구성 본 논문에서는 miniKinetics dataset(full Kinetics의 사전 실험용)은 동영상 테스트의 빠른 실험을 위한 dataset으로 213 class에 12만개의 clips로 class당 150~1000개의 clips로 구성(validation : 25 clips / test 75 clips) A Short Note about Kinetics-600 https://arxiv.org/pdf/1808.01340.pdf
  • 4. Action Recognition Benchmark DATASET YEAR # ACTIONS # CLIPS PER ACTION KTH 2004 6 10 Weizmann 2005 9 9 IXMAS 2006 11 33 Hollywood 2008 8 30-140 UCF Sports 2009 9 14-35 Hollywood2 2009 12 61-278 UCF YouTube 2009 11 100 MSR 2009 3 14-25 Olympic 2010 16 50 UCF50 2010 50 min. 100 HMDB51 2011 51 min. 101 first pretraining on Kinetics and then fine-tuning on HMDB-51 and UCF-101s a boost in performance
  • 5. HMDB-51 Dataset ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각 카테고리는 101개의 클립으로 구성 http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction
  • 6. UCF101 Dataset UCF101 Dataset : ICCV 2013에 공개된 데이터로 13320개의 비디오와 101개의 엑션 클래스가 있음 101개의 카테고리는 25그룹으로 나뉘는데 각각의 그룹은 4~7개의 액션이 정의된 비디오가 있음. https://www.crcv.ucf.edu/data/UCF101.php
  • 7. Video Architecture ImageNet Pre-trained Model(Inception-v1)활용하여 기존 아키텍쳐(a~d)와 본 논문에서 제시한 I3D(e)+pre-training on Kinetics와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시
  • 8. The Old I: ConvNet+LSTM Long-term Recurrent Convolutional Networks for Visual Recognition and Description(CVPR 2015) (https://arxiv.org/pdf/1411.4389.pdf) 25 Fps를 뽑은 후 각각을 CNN(Inception-V1:512 hidden units)으로 피쳐를 뽑아 LSTM(+batch norm)으로 시계열 정보를 예측 cross-entropy loss 사용 LSTM을 통해 연속적으로 학습해야하기 때문에 연산이 어려움
  • 9. The Old II: 3D ConvNets D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks (ICCV 2015) https://arxiv.org/pdf/1412.0767.pdf C3D로 정의하는 3D CNN은 추가적인 커널 수로 2D보다 많은 파라미터 수를 가지므로 학습에 어려움이 있음 배치별 15개의 비디오 K40 GPU로 학습 H x W x D => T x H x W x D (시간축이 추가됨-위아래 앞뒤로 3D로 움직임) 당시엔 ResNet같은 2D의 최적화된 네트워크가 없어 아래와 같이 새롭개 정의한 네트워크로 학습
  • 10. The Old III: Two-Stream Networks Motion Estimation을 하는 방법 중 효율적인 Optical Flow를 적용 RGB(Spatial)와 Optical Flow(Temporal)의 Two Stream Two-Stream Convolutional Networks for Action Recognition in Videos (NIPS 2014) (https://papers.nips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf)
  • 11. Convolutional two-stream network fusion for video action recognition (CVPR 2016) https://arxiv.org/pdf/1604.06573.pdf HMDB(Human Metabolome Database)에서 성능 개선을 얻은 방식으로 기존 방식에서 나온 피쳐를 복수개의 RGB와 Optical Flow로 추출하여 3D Conv로 학습시킴(3D Conv는 피쳐 추출이 아닌 loss 줄이기 위한 용도) 네트워크는 Inception-V1을 사용 10 Frame간격으로 5개의 연속된 RGB Frame을 Optical Flow의 피쳐와 합쳐 End-to-End로 학습 The Old IV: 3D-Fused Two-Stream
  • 12. The New: Two-Stream Inflated 3D ConvNets(I3D) 본 논문에서 제안하는 방식으로 기존 2D로 피쳐를 뽑는 방식에서 복수 Frame인 RGB와 Optical flow를 동시에 3D Conv로 피쳐를 한번에 뽑아 합치는 방식으로 이를 통해 RGB의 연속된 움직임과 Optical Flow의 변화량을 정확하게 뽑아낼 수 있다고함.(새로운 3D 네트워크가 아닌 Inception v1을 쌓아 3D 만듬 - 검증된 네트워크 활용) 2D->3D(N × N filters become N × N × N : 디멘젼을 늘림) 3D Conv의 경우 복수개의 Frame와 Optical Flow가 입력이 되기에 ImageNet Pre-trained Model활용을 위해 기존 ImageNet의 모델을 N번(3D) 만큼 복사하여 붙임.(기존 weight 활용가능) repeating the weights of the 2D filters N times along the time dimension Two-Stream Inflated 3D ConvNet (I3D)의 명칭으로 제안함 I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow information. We trained the two networks separately and averaged their predictions at test time
  • 13. Implementation Details Inception V1의 네트워크를 성능 개선을 위해 수정 공간정보와 시간 정보를 동시에 학습시키기 때문에 stride의 수 조정이 중요한데 stride가 크면 공간을 못보고 stride가 작으면 움직임(시간)을 못봄 (첫번째와 두번째의 Max-Pool에서는 시간축으로는 보지 않음) RGB만으로도 각 Frame 피쳐의 변화를 통해 모션 예측이 가능하지만 Optical Flow를 통해 더 세부적인 모션을 예측할 수 있음
  • 14. Implementation Details All models but the C3D-like 3D ConvNet use ImageNet pretrained Inception-V1 as base network. For all architectures we follow each convolutional layer by a batch normalization layer and a ReLU activation function, except for the last convolutional layers which produce the class scores(1x1x1) for each network. Training on videos used standard SGD with momentum set to 0.9 in all cases, with synchronous parallelization across 32 GPUs for all models except the 3D ConvNets(64 GPUs). We trained models on miniKinetics for up to 35k steps, and for 110k steps on Kinetics, with a 10% reduction of learning rate when validation loss saturated. We tuned the learning rate hyperparameter on the validation set of miniKinetics. Models were trained for up to 5k steps on UCF-101 and HMDB-51 using a similar learning rate adaptation procedure as for Kinetics but using just 16 GPUs. All the models were implemented in TensorFlow (https://github.com/deepmind/kinetics-i3d) During training data augmentation random cropping both spatially (resizing the smaller video side to 256 pixels, then randomly cropping a 224 × 224 patch and temporally, random left-right flipping, photometric) During test time the models are applied convolutionally over the whole video taking 224 × 224 center crops, and the predictions are averaged. We computed optical flow with a TV-L1 algorithm
  • 15. Experimental Comparison of Architectures UCF-101/HMDB-51로 학습 한 모델과 Kinetics Pre-trained 사용(Cl에 따른 성능의 차이를 보여줌
  • 16. Comparison with the SOTA and Next Comparison with state-of-the-art on the UCF-101 and HMDB-51 datasets, averaged over three splits. First set of rows contains results of models trained without labeled external data. As future work, for other video tasks such as semantic video segmentation, video object detection, or optical flow computation we have not employed action tubes or attention mechanisms to focus in on the human actors. we plan to repeat all experiments using Kinetics instead of miniKinetics, with and without ImageNet pre-training, and explore inflating other state-of-theart 2D ConvNets
  • 17. Thanks Any Questions? You can send mail to Susang Kim(healess1@gmail.com)