Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

CVPR2016を自分なりにまとめてみた

4.522 Aufrufe

Veröffentlicht am

6/27~7/1に開催されたCVPR2016に参加して来たので,自分なりにまとめてみました.割とアバウトに書いてあるので,こんな研究があったんだなと思っていただければ幸いです.

Veröffentlicht in: Wissenschaft

CVPR2016を自分なりにまとめてみた

  1. 1. CVPR2016を自分なりにまとめてみた Machine Perception& Robotics Group Hiroshi Fukui
  2. 2. 自己紹介 • 名前:福井 宏 – 所属:中部大学 Machine Perception and Robotics Group (藤吉研究室) – TwitterID : @Catechine0125 – HP:https://sites.google.com/site/fhiroresearch/home • 主な研究テーマ 1 歩行者検出 歩行者属性認識
  3. 3. 参加報告 • 6/27 ~ 7/1にCVPR2016へ参加 – 共著に入っていたので山下 隆義先生に引っ付いていきました – 6/19 ~ 6/22は IEEE Intelligent Vehicle に参加 (ここではまとめません) 2 6 / 19 – 6 / 23 6 / 27 – 7 / 16 / 26 7 / 36 / 246 / 18 6 / 25 ↑ なぜか,1度日本に帰国・・・
  4. 4. 参加報告 ~CVPR2016編~ • CVPR(Computer Vision and Pattern Recognition)とは? – 画像認識の分野におけるトップカンファレンス – 画像認識やLow levelな処理,3次元画像認識など分野は多種多様 • 開催地:アメリカ ネバダ州 ラスベガス • 開催日:6/26 ~ 7/1 3
  5. 5. 今年のCVPRの概要 • Accept rate : 29.9% (643 / 2145) – Long oral : 3.9% (83件) – Short oral : 9.7% (123件) 4 メインカンファレンス (4日間) チュートリアル (1日) ワークショップ (1日) ↓CVPR2016のプログラム
  6. 6. 論文の傾向 5 Learning Deep Image Object Detection Using Networks Recognition Convolutional Neural Video via Estimation Segmentation Classification Images Visual Feature Semantic Action Network Robust Tracking Objects Shape Pose Model Videos Scene Reconstruction Human FaceEfficient Sparse Analysis Fast Matching Person Scenes Prediction Data Joint Approach Training Structured Saliency Motion Search Recurrent Unsupervised Temporal Hierarchical Representations Flow Single Dynamic Camera Localization Real-Time Re-Identification Understanding Supervised Fine-Grained CNN Stereo Selection Dataset Large TransferDiscriminative Depth Alignment Models Facial Based Regression Large-Scale Modeling Dense Matrix Field Features Framework Online Weakly Multi-View Multiple Activity Accurate Simultaneous Point Text Automatic Set Scale Light Predicting Similarity Clustering CNNs Sequences Domain Tensor Parsing Linear Zero-Shot Hand Random Optical Fields Answering Structure Space Inference Representation Optimization Kernel Adaptive Algorithm Vision Pairwise Descriptors Salient Correspondences Embedding Loss Consistency Registration Multi-Label Question Metric Priors Cascaded Label Distance Retrieval Classifiers Gaussian Recognizing Egocentric Local Actions Fusion Distribution RGB-D Captioning Surface Mining Benchmark Detecting Manifold Indoor End-To-End Maps Background Proposals Look Deblurring Rolling Applications Ranking Pooling Optimal Labeling Low Language Patch Correspondence Latent Attention Faces Coding Shutter Novel Complex Proposal Active Subspace New Urban Natural Intensity Occlusion Context Recovery Supervision Information Noise ResolutionIterative Propagation Blind Denoising Volumetric Crowded Constrained Uncalibrated Deformable Monocular Trajectory
  7. 7. 論文の傾向 6 Learning Deep Image Object Detection Using Networks Recognition Convolutional Neural Video via Estimation Segmentation Classification Images Visual Feature Semantic Action Network Robust Tracking Objects Shape Pose Model Videos Scene Reconstruction Human FaceEfficient Sparse Analysis Fast Matching Person Scenes Prediction Data Joint Approach Training Structured Saliency Motion Search Recurrent Unsupervised Temporal Hierarchical Representations Flow Single Dynamic Camera Localization Real-Time Re-Identification Understanding Supervised Fine-Grained CNN Stereo Selection Dataset Large TransferDiscriminative Depth Alignment Models Facial Based Regression Large-Scale Modeling Dense Matrix Field Features Framework Online Weakly Multi-View Multiple Activity Accurate Simultaneous Point Text Automatic Set Scale Light Predicting Similarity Clustering CNNs Sequences Domain Tensor Parsing Linear Zero-Shot Hand Random Optical Fields Answering Structure Space Inference Representation Optimization Kernel Adaptive Algorithm Vision Pairwise Descriptors Salient Correspondences Embedding Loss Consistency Registration Multi-Label Question Metric Priors Cascaded Label Distance Retrieval Classifiers Gaussian Recognizing Egocentric Local Actions Fusion Distribution RGB-D Captioning Surface Mining Benchmark Detecting Manifold Indoor End-To-End Maps Background Proposals Look Deblurring Rolling Applications Ranking Pooling Optimal Labeling Low Language Patch Correspondence Latent Attention Faces Coding Shutter Novel Complex Proposal Active Subspace New Urban Natural Intensity Occlusion Context Recovery Supervision Information Noise ResolutionIterative Propagation Blind Denoising Volumetric Crowded Constrained Uncalibrated Deformable Monocular Trajectory Recognition, Video, Low-level, …
  8. 8. 論文の傾向 7 Learning Deep Image i Object Detection Using ONetworks Recognition g Convolutional Neural Video via Estimation Segmentation N Classification Images Visual Feature Semantic Action g Network Robust Tracking Objects Shape DDPose kk Model Videos OO Scene g Reconstruction Human FaceEfficient Sparse Analysis Fast Matching Person ii Prediction Data Joint Approach tt Training Structured Saliency d l Motion Search Recurrent Unsupervised Temporal Hierarchical Representations Flow Single Dynamic Camera Localization Real-Time Re-Identification Understanding hi Supervised Fine-Grained CNN Stereo Selection Large TransferDiscriminative Depth Alignment Models pp Facial Based Regression Large-Scale Modeling D M t Field Features Framework Online Weakly Multi View Multiple Activity Accurate Simultaneous Point Text Automatic Set Scale Light Predicting Similarity Clustering CNNs Sequences Domain Tensor Parsing Linear Zero-Shot Hand Random Optical Fields Answering Structure Space Inference A Representation Optimization Adaptive Algorithm Vision Pairwise Descriptors Salient Correspondences Embedding Consistency Registration Multi-Label Question Metric Priors Cascaded Label Distance Retrieval Classifiers Gaussian Recognizing Egocentric L l Actions Fu Distribution RGB D Captioning Mining Benchmark Detecting Manifold LeLe End-To-End Maps Background Proposals Deblurring Rolling Applications Ranking Pooling Optimal Labeling w Language Patch Correspondence Latent Attention Faces Coding Shutter Novel Complex Proposal Active Subspace e New Urban Natural Intensity Occlusion Context Recovery Supervision Information Noise ResolutionIterative Propagation Blind Denoising Volumetric Crowded Constrained Uncalibrated Deformable Monocular Trajectory 画像認識分野の発表のほとんどが Deep Learning及びCNNを使用
  9. 9. 論文の傾向 8 Learning Deep Image i Object Detection Using O g Networks Recognition Convolutional g NeuralN Video via Estimation Segmentation Classification Images Visual Feature Semantic Action g Network Robust Trackingg Objects Shape DDPose Model Videos OO Scene Reconstruction Human FaceEfficient Sparse Analysis Fast Matching Person ii Prediction Data Joint Approach tt Training Structured Saliency d l Motion Search Recurrent Unsupervised Temporal Hierarchical Representations Flow Single Dynamic Camera Localization Real-Time Re-Identification Understanding hi Supervised Fine-Grained CNN Stereo Selection Large TransferDiscriminative Depth Alignment Models pp Facial Based Regression Large-Scale Modeling D M t Field Features Framework Online Weakly Multi View Multiple Activity Accurate Simultaneous Point Text Automatic Set Scale Light Predicting Similarity Clustering CNNs Sequences Domain Tensor Parsing Linear Zero-Shot Hand Random Optical Fields Answering Structure Space Inference A Representation Optimization Adaptive Algorithm Vision Pairwise Descriptors Salient Correspondences Embedding Consistency Registration Multi-Label Question Metric Priors Cascaded Label Distance Retrieval Classifiers Gaussian Recognizing Egocentric L l Actions Fu Distribution RGB D Captioning Mining Benchmark Detecting Manifold LeLe End-To-End Maps Background Proposals Deblurring Rolling Applications Ranking Pooling Optimal Labeling w Language Patch Correspondence Latent Attention Faces Coding Shutter Novel Complex Proposal Active Subspace e New Urban Natural Intensity Occlusion Context Recovery Supervision Information Noise ResolutionIterative Propagation Blind Denoising Volumetric Crowded Constrained Uncalibrated Deformable Monocular Trajectory 人の姿勢推定やRe-Identificationに 関係する研究が増加 Fast/Faster R-CNNの提案で 物体検出(認識)の研究が 数多く発表 セグメンテーションではCNNと MRF/CRFの組み合わせが 数多く提案
  10. 10. 物体検出 or 認識 • 基本的にはFast / Faster R-CNNをベースとした高精度な物体検出 – 処理速度が速く高精度に扱えるため頻繁に使用 • ネットワーク自体の性能を向上させることで物体検出の性能を向上 – 転移学習の利用 – 中には,より高速に認識できるように工夫している手法も存在 (YOLO) – 層を数多く積み重ねて性能を上げる方法も提案 • 入力するデータの工夫 – 距離情報や点群データとの組み合わせ 9
  11. 11. You Only Look Once : Unified, Real Time Object Detection • 物体領域探索をグリッドベースにすることで効率化 – バウンディングボックスとスコアを出力するGoogLeNetをベースとしたCNNを使用 – ノートPCでもリアルタイムで処理が可能 10
  12. 12. Deep Residual Learning for Image Recognition (Best paper) • ResNet:非常に深いCNN – 単純に層を積み重ねると誤差勾配の発散や消滅が発生 → 入力と出力の残差が最小になるように学習 – ImageNetでは152層,CIFAR-10では110層のときが最も性能が良い – Fast / Faster R-CNNと組み合わせることも可能 11
  13. 13. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images • RGB画像と点群データを用いた3次元物体認識を行うネットワークモデルを提案 – 3D Amodal Region Proposal Network:物体候補を検出するネットワーク – Joint Object Recognition Network:3D空間から物体の位置を推定して物体認識 12 3D Amodal Region Proposal Network Joint Object Recognition Network
  14. 14. Dense Human Body Correspondences Using Convolutional Networks • CNNを用いた3次元における人体同士のマッチングと非剛体のレジストレーション – 3種類の3D人体データセットを使って特徴抽出と対応点のマッチング • 高精度かつリアルタイムに処理が可能 13 descriptor classification prediction 1 classification prediction N... descriptor classification prediction 1 classification prediction N... descriptor classification prediction loss function input full model depth generator depth maps feature extraction and averaging per vertex descriptors Figure2: Wetrain aneural network which extracts afeature descriptor and predicts thecorresponding segmentation label on thehuman body surfacefor each point in theinput depth maps. Wegenerate per-vertex descriptors for 3D modelsby averag- ing the feature descriptors in their rendered depth maps. Weusetheextracted features to compute dense correspondences. 0 1 2 3 4 5 6 7 8 9 10 layer image conv max conv max 2× conv conv max 2× conv int conv filter-stride - 11-4 3-2 5-1 3-2 3-1 3-1 3-2 1-1 - 3-1 channel 1 96 96 256 256 384 256 256 4096 4096 16 activation - relu lrn relu lrn relu relu idn relu idn relu size 512 128 64 64 32 32 32 16 16 128 512 num 1 1 4 4 16 16 16 64 64 1 1 Table 1: The end-to-end network architecture generates a per-pixel feature descriptor and a classification label for all pixels in adepth map simultaneously. From top to bottom in column: Thefilter size and thestride, thenumber of filters, thetypeof theactivation function, thesize of theimage after filtering and thenumber of copies reserved for up-sampling. training mesh segmentation 1 segmentation 2 segmentation 3 4. Implementation Details We first discuss how we generate the training data and then describe the architecture of our network. 4.1. Training Data Generation Collecting 3D Shapes. To generate the training data for
  15. 15. 姿勢推定:Convolutional Pose Machine • カスケード型に配置したCNNによる姿勢推定 – 各関節位置の尤度マップを出力 • ステージを進める毎に注目領域を拡大して高精度な尤度マップを出力 – tステージにはt-1ステージの尤度マップと特徴マップを入力 14 9⇥9 C 1⇥1 C 1⇥1 C 1⇥1 C 1⇥1 C 11⇥11 C 11⇥11 C LossLoss f 1 f 2 (c) Stage 1 Input Image h⇥w⇥3 Input Image h⇥w⇥3 9⇥9 C 9⇥9 C 9⇥9 C 2⇥ P 2⇥ P 5⇥5 C 2⇥ P 9⇥9 C 9⇥9 C 9⇥9 C 2⇥ P 2⇥ P 5⇥5 C 2⇥ P 11⇥11 C (e) E↵ective Receptive Field x x0 g1 g2 gT b1 b2 bT 2 T (a) Stage 1 PoolingP ConvolutionC x0 Convolutional Pose Machines (T –stage) x x0 h0 ⇥w0 ⇥(P + 1) h0 ⇥w0 ⇥(P + 1) (b) Stage ≥ 2 (d) Stage ≥ 2 9 ⇥9 26 ⇥26 60 ⇥60 96 ⇥96 160 ⇥160 240 ⇥240 320 ⇥320 400 ⇥400 Figure 2: Architecture and receptive fields of CPMs. We show a convolutional architecture and receptive fields across layers for a CPM with any T stages. Theposemachine[29] isshown in insets(a) and (b), and thecorresponding convolutional networksareshown in insets(c) and (d). Insets(a) and (c)
  16. 16. 論文の傾向 15 Learning Deep Image i ObjectO Detection Using O g Networks Recognition Convolutional g Neural Video via Estimation Segmentation N Classification Images Visual Feature Semantic Action Network Robust Tracking Objects Shape DDPose kk Model Videos Scene g Reconstruction Human FaceEfficient Sparse Analysis Fast Matching Person ii Prediction Data Joint Approach tt Training Structured Saliency d l Motion Search Recurrent Unsupervised Temporal Hierarchical Representations Flow Single Dynamic Camera Localization Real-Time Re-Identification Understanding hi Supervised Fine-Grained CNN Stereo Selection Large TransferDiscriminative Depth Alignment Models pp Facial Based Regression Large-Scale Modeling D M t Field Features Framework Online Weakly Multi View Multiple Activity Accurate Simultaneous Point Text Automatic Set Scale Light Predicting Similarity Clustering CNNs Sequences Domain Tensor Parsing Linear Zero-Shot Hand Random Optical Fields Answering Structure Space Inference A Representation Optimization Adaptive Algorithm Vision Pairwise Descriptors Salient Correspondences Embedding Consistency Registration Multi-Label Question Metric Priors Cascaded Label Distance Retrieval Classifiers Gaussian Recognizing Egocentric L l Actions Fu Distribution RGB D Captioning Mining Benchmark Detecting Manifold LeLe End-To-End Maps Background Proposals Deblurring Rolling Applications Ranking Pooling Optimal Labeling w Language Patch Correspondence Latent Attention Faces Coding Shutter Novel Complex Proposal Active Subspace e New Urban Natural Intensity Occlusion Context Recovery Supervision Information Noise ResolutionIterative Propagation Blind Denoising Volumetric Crowded Constrained Uncalibrated Deformable Monocular Trajectory RNNが使われるようになったことで 動画解析(行動認識やトラッキング等), キャプション生成, Visual Question Answering の研究が大きく発展
  17. 17. A Key Volume Mining Deep Framework for Action Recognition • 動画中の人物の行動認識でキーとなる領域を重点的に学習するCNN+RNNのモデル を提案 – 2D+時系列の3次元の畳み込みをするCNNとstochastic outを用いることでキーフレーム を選択して行動認識 16
  18. 18. DenseCap : Fully Convolutional Localization Network for Dense Captioning • 検出した物体領域ごとにキャプションを作成するネットワーク – Region Proposal NetworkをベースにしたFully Convolutional Localization Networkを使用 17
  19. 19. Stacked Attention Networks for Image Question Answering • 質問に対して画像中のどこに着目すればいいのかを示すAttention layerを導入したネ ットワーク – Attention layerをスタックすることでより正確な着目位置を推定 18 feng Gao2 , Li Deng2 , Alex Smola1 t Research, Redmond, WA 98052, USA deng} @mi cr osof t . com, al ex@smol a. or g Question: What are sitting in the basket on a bicycle? CNN/ LSTM Softmax dogs Answer: CNN + Query + Attention layer 1 Attention layer 2 feature vectors of different parts of image (a) Stacked Attention Network for ImageQA hat learn to answer natural language questions from im- ges. SANs use semantic representation of a question as uery to search for the regions in an image that arerelated o the answer. We argue that image question answering QA) often requires multiple steps of reasoning. Thus, we evelop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experi- ments conducted on four image QA data sets demonstrate hat the proposed SANs significantly outperform previous ate-of-the-art approaches. The visualization of the atten- on layers illustrates the progress that the SAN locates the elevant visual clues that lead to theanswer of the question ayer-by-layer. . Introduction With the recent advancement in computer vision and n natural language processing (NLP), image question an- wering (QA) becomes one of the most active research ar- as [7, 21, 18, 1, 19]. Unlike pure language based QA sys- emsthat havebeen studied extensively in theNLPcommu- ty [28, 14, 4, 31, 3, 32], imageQA systemsaredesigned to utomatically answer natural language questions according Question: What are sitting in the basket on a bicycle? CNN/ LSTM Softmax d Ans CNN + Query + Attention layer 1 Attention layer 2 (a) Stacked Attention Network for Image QA Original Image First Attention Layer Second Attention Layer (b) Visualization of the learned multiple attention layers. The stacked attention network first focuses on all referred concepts, e.g., bi cycl e, basket and objects in the basket (dogs) in thefirst attention layer and then further narrowsdown thefocus in thesecond layer and finds out theanswer dog.
  20. 20. その他の傾向 • 変わった(or 新規の)問題設定への挑戦 – 動画像からの音推定,顔画像からの心拍数推定,Face2Face等 • データセットの作成 19
  21. 21. 音の推定 • 動画像をCNN+RNNに入力して音波を出力 20
  22. 22. 顔画像からの心拍数推定 • 顔の局所的な領域から色特徴量を抽出 – 抽出した色特徴量をSelf-Adaptive Matrix Completionを適用 • SAMCから得られたパワースペクトルを使って心拍数を推定 21 2. Feature Extraction Feature Extraction Region 1 Region 2 Region R ... ... ROI extraction ROI Warping 1. Face Region Extraction 3. Self-Adaptive Matrix Completion Observation matrix Low-rank matrix Prior mask SAMC Estimated Mask 0 1 2 3 4 5 6 Frequency, Hz HR FrequencySignal estimated using SAMC Magnitude Power spectral density estimation 4. Heart Rate Estimation Figure2. Overview of theproposed approach for HR estimation. During thefirst phase, weautomatically detect aset of facial keypointsand
  23. 23. Face2Face • Sourceの表情をTargetに反映 – 2つの顔のアライメントを合成することでSourceの表情を反映 22
  24. 24. 新しいデータセットの作成:Cityscapes Dataset • 従来のセグメンテーション用データセットより規模の大きいデータセット – 従来のセグメンテーション用データセット:CamVid Dataset • 1都市で撮影した約700フレームの画像からデータセットを構築 – CityScapes Datasetの規模 • 50都市で撮影した約25,000フレームの画像からデータセットを構築(細:5,000,粗 :20,000) 23
  25. 25. 新しいデータセットの作成:SYNTHIA Dataset • 自動車の走行シーンをCGで作成したデータセット – セグメンテーションデータや距離データも公開 24
  26. 26. 新しいデータセットの作成:WIDER FACE • 大規模な顔検出 & 顔属性データセット 25
  27. 27. まとめ • 画像認識の分野ではDeep Learningを使った手法がほとんど – RNNの登場で動画理解とキャプション生成,VQAの研究が数多く登場 • キャプション生成&VQAのオーラルセッションが2回も設けられるくらい活発に – Fast,Faster-RCNNとRNNを使った研究がとても多かった印象が強い – Deep Learningの性能を向上させるために大規模なデータセットを作ったという論文も存 在 – Deep Learning以外の手法(MRF, RF, SVM…)は後処理として使用 26

×