13. 最近のDBでは背景が効いているんじゃ?
• Two-stream CNNでもRGBの⼊⼒
– UCF101, HMDB51などは⼈物領域と⽐較して背景領域が⼤きい
– RGBを⼊⼒とした空間情報のみを⽤いて⾼い識別を実現
• Two-stream CNNのspatial-streamだけでも70%強の識別率@UCF101
• “Human Action Recognition without Human”の提案
• (⼈を⾒ない⼈物⾏動認識)
Y. He, S. Shirakabe, Y. Satoh, H. Kataoka “Human Action Recognition without Human”, in ECCV 2016
Workshop on on Brave New Ideas for Motion Representations in Videos (BNMW). (Oral & Best Paper)
賀雲, ⽩壁奏⾺, 佐藤雄隆, ⽚岡裕雄, “⼈を⾒ない⼈物⾏動認識”, ViEW, 2016 (ViEW若⼿奨励賞)
15. w/ and w/o Human Setting
• With / Without human setting
– Without human setting: 中央部分が⿊抜き
– With human setting: Without human settingのインバース
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
ー ー
Without Human SeFng With Human SeFng
16. 実験の設定
– ベースライン: Very deep two-stream CNN [Wang+, arXiv15]
– ⼆つの設定: without human and with human
17. 実験結果
• @UCF101
– UCF101 pre-trained model with very deep two-stream CNN
– With/Without Human Setting
25. Appearance-based Recognition (1)
• Multi-View CNN [Su+, ICCV15]
– 各視点毎に物体のアピランスを学習
– View Pooling (VP)により特徴統合
H. Su et al. “Multi-view Convolutional Neural Networks for 3D Shape Recognition”, in ICCV, 2015.
26. Appearance-based Recognition (2)
• RotationNet [Kanezaki+, arXiv16]
– MV-CNNがベースになっている
– 同時学習により物体ラベルに加えて回転姿勢も推定可能
A. Kanezaki et al. “RotationNet: Joint Learning of Object Classification and Viewpoint Estimation using Unaligned 3D
Object Dataset”, in arXiv pre-print 1603.06208, 2016.
27. Model-based Recognition (1)
• SHOT [Tombari+, ECCV10]
– ⾮曖昧性やユニーク性を兼ね備えた特徴点マッチング
– キーポイント周辺の球体から3次元点群の法線をヒストグラム化
F. Tombari et al. “SHOT: Unique Signatures of Histograms for Local Surface Description”, in ECCV, 2010.
28. Model-based Recognition (2)
• 3D ShapeNets [Wu+, CVPR15]
– 距離画像からのボリューム表現と3次元畳み込み
– 識別が曖昧な際にはNext Best Viewを選択
Z. Wu et al. “3D ShapeNets: A Deep Representation for Volumetric Shape Modeling”, in CVPR, 2015.
29. Model-based Recognition (3)
• (Deep) Sliding Shapes [Song+, ECCV14/CVPR16]
– 3D CAD ModelによりDepth/3D Point Cloud空間から特徴抽出
– 2D/3D畳み込みの統合CNNにより識別 (Deep Sliding Shapes)
S. Song et al. “Sliding Shapes for 3D Object Detection in Depth Images”, in ECCV, 2014.
S. Song et al. “Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images”, in CVPR, 2016.
30. Model-based Recognition (4)
• OctNet [Riegler+, arXiv16]
– Octreeを導⼊し3次元の疎密畳み込み
– Conv/UnConvによりラベル推定やセマンティックセグメンテーション
G. Riegler, et al. “OctNet: Learning Deep 3D Representations at High Resolutions”, in arXiv pre-print 1611.05009, 2016.
37. YFCC100M Dataset
• Yahoo!が提供する最も膨⼤なFlickrデータセット
– 1億枚のFlickr画像をCreative Commonsライセンスで公開
– 世界中の画像に位置情報が付与
B. Thomee et al. "YFCC100M: The New Data in Multimedia Research", Communications of the ACM, 59(2), pp.
64-73, 2016.
38. Geo-tag画像の解析
• 3次元再構成
– Building Rome in a Day [Agarwal+, ICCV09]
– Reconstructing the World in Six Days [Heinly+, CVPR15]
• シーン認識
– City Perception [Zhou+, ECCV14]
39. Building Rome in a Day
• “ローマを⼀⽇にしてなす”
– ⼤規模空間の3次元再構成
– アプリケーションとして⾮常に優れた⾒せ⽅
hGps://www.youtube.com/watch?v=kxtQqYLRaSQ
S. Agarwal et al. ”Building Rome in a Day", ICCV, 2009.
40. Reconstructing the World in Six Days
• “世界を6⽇で作る”
– YFCC100M Datasetを利⽤
– 1億枚のgeo-tag画像があればこんなことができるという好例
J. Heinly et al. ”Reconstructing the World in Six Days", in CVPR, 2015.
41. City Perception
• 世界21都市の特徴を解析・可視化
– シーン認識をベースにした都市解析
– 社会学的な知⾒から7つの代表的な属性を累積 – 建物,⽔量,緑,交通
B. Zhou et al. “Attribute Analysis of Geo-tagged Images for City Perception” in ECCV, 2014.
42. City Perception
• SNSを⽤いたGeo-tag画像 + シーン認識
– 位置と認識結果の可視化 (左図)
• 世界の都市の状況を可視化
– 世界21都市の類似度解析 (右図)
• ⼤陸内が類似していることを実証
B. Zhou et al. “Attribute Analysis of Geo-tagged Images for City Perception” in ECCV, 2014.
49. YFCC100M Dataset(再掲)
• Yahoo!が提供する最も膨⼤なFlickrデータセット
– 1億枚のFlickr画像をCreative Commonsライセンスで公開
– 世界中の画像に位置情報が付与
– SNSから⾃動で収集
B. Thomee et al. "YFCC100M: The New Data in Multimedia Research", Communications of the ACM, 59(2), pp.
64-73, 2016.
63. アフォーダンスによる⾏動予測
• ロボットによる⽣活⾏動⽀援
– ⾏動と物体の認識
– アフォーダンスにより⾏動と物体を接続
hGps://www.youtube.com/watch?v=dZyp41qBZBE
H. Koppula et al. “Anticipating Human Activities using Object Affordances for Reactive Robotic Response” in RSS, 2013.
64. 背景知識の活⽤による⾏動予測
• コンテキストを導⼊した⾏動予測
– ⾏動遷移をシーケンスとして⼊⼒
– 追加情報(e.g. 時間帯)も考慮
??? Daytime
(Time Zone)
Walking
(Previous Activity)
Sitting
(Current Activity)
???
(Next Activity)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series
H. Kataoka, et al. “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.
65. 遷移⾏動認識 (Transitional Action Recognition)
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
Δt
【Proposal】
Short-term action prediction
recognize “cross” at time t5
【Previous works】
Early action recognition
recognize “cross” at time t9
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
H. Kataoka et al. “Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature”, in BMVC,
2016.
66. 遷移⾏動認識 (Transitional Action Recognition)
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
⼿法 設定
⾏動認識
早期⾏動認識
⾏動予測
遷移⾏動認識
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
72. 強化学習:PickingRobot (Google Brain)
• 複数のロボットが協調してピッキングタスクを学習
– 実空間にて⾃ら学習データと成功/失敗を判断
C. Finn et al. “Unsupervised Learning for Physical Interaction through Video Prediction”, in NIPS, 2016.
85. 歴史は繰り返す?
1st AI 2nd AI 3rd AI
F. Rosenblatt et al. “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms” in 1961.
J. F. Canny “Finding Edges and Lines in Images” in 1983.
Rumelhart et al. “Learning representations by back-propagating errors” in Nature 1986.
K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, in
1980
Y. LeCun et al. “Gradient-based learning applied to document recognition” in IEEE 1998.
J. H. Holland “Adaptation in Natural and Artificial Systems” in MIT Press 1975.
L. J. Eshelman “The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination," in
Foundations of Genetic Algorithms, 1991.
G. Huang et al. “Deep Networks with Stochastic Depth," in arXiv pre-print, 2016.
T. Yamasaki et al. “Efficient Optimization of Convolutional Neural Networks Using Particle Swarm Optimization," in MIRU, 2016.
89. 現在のアプローチ
– 時系列:Dense Trajectories (DT), Improved DT (IDT)
– 3次元特徴:Cloud of Oriented Gradients (COG)
Z. Ren et al. “Three-Dimensional Object Detection and Layout Prediction using
Clouds of Oriented Gradients” in CVPR, 2016.
H. Wang et al. “Action Recognition by Dense Trajectoires”
in CVPR, 2011.
90. D. Silver et al. “Mastering the game of Go with deep neural networks and tree search” in Nature, 2016.
AIによる訓練
– 新しい知性を感じた (AlphaGo)
92. CNN/RNNのパラメータ
ボリュームデータ (e.g. 動画,3D) 特徴への転移
Z. Ren et al. “Three-Dimensional Object Detection and Layout Prediction using
Clouds of Oriented Gradients” in CVPR, 2016.
H. Wang et al. “Action Recognition by Dense Trajectoires”
in CVPR, 2011.
93. Post-CNN, RNN
• Deep LearningとHnad-craft特徴の協調
– Deep Learningは解析されていく⽅向へ
– 仕組みの理解とより良いものへの統合
• CNN/RNNのパラメータを転移させる仕組みが必要
96. ピクセルの再定義
RGBの枠組みで良いのか?ピクセルの表現は?
H. P. Gage. “Optic projection, principles, installation, and use of the magic lantern, projection microscope, reflecting
lantern, moving picture machine," in 1914.
103. CV => CG [Iizuka+, SIGGRAPH16] Colorization
Innovative Technologies2016 特別賞「Culture」
S. Iizuka et al. “Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization
with Simultaneous Classification” in ACM TOG, 2016.
104. L. Shnerider et al. “Semantic Stixels: Depth is Not Enough” IEEE IV, 2016.
CV => ITS [L. Shneider+, IV16] Semantic Stixels
(IV16 Best Paper)
105. D.-A. Huang et al. “Connectionist Temporal Modeling for Weakly Supervised Action Labeling” in ECCV, 2016.
A. Owens et al. “Visually Indicated Sounds” in CVPR, 2016.
Audio => CV (CVPR16 oral, ECCV16)
106. Natural Language Processing => CV (CVPR16 oral)
J. Johnson et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning” in CVPR, 2016.