【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework

Activity Prediction
Using a Space-Time CNN and Bayesian Framework
Hirokatsu KATAOKA, Yoshimitsu AOKI†, Kenji IWATA, Yutaka SATOH
National Institute of Advanced Industrial Science and Technology (AIST)
† Keio University
http://www.hirokatsukataoka.net/

Background
•  Computer vision for human sensing
–  Detection, tracking, trajectory analysis
–  Posture estimation, action analysis
–  Action recognition is able to extend human sensing applications
Mental state
Body Situation
Attention
Action Analysis
shakinghands
Look at people
Detection
Gaze Estimation
Action Recognition
Posture Estimation
Face Recognition
Trajectory extraction
Tracking

Related work 1: Action Recognition
•  Action is a low-level primitive with semantic meaning
–  e.g. walking, running, sitting
This image contains a man walking
- The classification (location is given)
Action recognition
Walking

Is action recognition enough?
Time-series
Post-detection
Event detection
(Action tag : Ai)
Time-series
Event prediction
(Prediction tag : Aj)
Pre-estimation

Related work 2: Early Action Recognition
•  Prediction in early part of action
–  Integral bag-of-words
–  Accumulating likelihood through time-sequence
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on
Computer Vision (ICCV), pp.1036-1043, 2011.

Proposal
•  Action prediction within a ST-CNN and Bayesian
framework
–  Action recognition
–  Database analysis
？？？ Daytime
(Time Zone)
Walking
(Previous Action)
Sitting
(Current Action)
???
(Next Action)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series

Problem settings
•  Three different works in action analysis
–  Action recognition
•  Recognizing At given 1 ~ t frames
–  Early action recognition
•  Recognizing At given 1 ~ t-L frames
–  Action Prediction
•  Recognizing At+L given 1 ~ t frames
Approach Setting
Action Recognition
Early Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L

Process flow
•  Consist of (i) action recognition (ii) action prediction
1.  Action recognition
1.1 Improved dense trajectories (IDT)
1.2 Space-time convolutional neural networks (ST-CNN)
2.  Action prediction
2.1 Bayesian framework
2.2 Database
x x x x x x x x x x x x x x x
x x x
Trajectory (in t + L frames)
Feature extraction
(HOG, HOF, MBH, Traj.)
Bag-of-words (BoW)
Pedestrian detection IDT
Input
Conv
Conv
Pool
FC
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
ST-CNN Oxford VGG architecture (VGGNet)

Action Recognition (1/2)
•  Improved Dense Trajectories (IDT) [Wang+, ICCV2013]
–  Pyramidal image sequences and flow tracking
–  Feature descriptors on trajectories
–  Feature representation with bag-of-words (BoW)
sittingwalking

•  IDT + Co-occurrence HOG [Kataoka+, ACCV2014]
CoHOG: edge-pair counting to corresponding histogram position
Extended CoHOG(ECoHOG): edge-magnitude accumulation
–  PCA dim. reduction: 103 - 104 dims into 101-102 ,easy to divide in feature space

•  Space-time Convolutional Neural Networks (ST-CNN)
–  Based on VGG 16-layer architecture (VGGNet) [Simonyan+, ICLR2015]
–  Statio-temporal feature concatenation (around 10 frames)
Space-time CNN (ST-CNN) Feature
Input
Conv
Conv
Pool
FC
FC
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
FC
So3max
・・・
CNN architecture with VGGNet

Action Prediction (1/2)
•  Prediction model
- Action sequence
Predicting “Using a PC” at “Walk” => “Sit”
- Time zone (supplemental info.)
Day time
？？？ Daytime
(Time Zone)
Walking
(Previous Activity)
Sitting
(Current Activity)
???
(Next Activity)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series

•  Database: ST-action tags + attribute
–  Time zone
•  “morning”, “day time”, “night”
–  Previous & current action
•  “walk”, “bend”, “stand”, “sit”…
–  Next action (objective)
•  “use a PC”, “read”, “meal”…
Action Prediction (2/2)
Action History DB
Walking
Sitting
Using a PC
Daytime

Experiments on the Daily Living Data
–  Total 20h of video
–  3 different scenes
–  640x480, 30fps

Results
•  Action recognition
–  IDT (HOG, HOF, MBH, CoHOG, ECoHOG, All)
–  Per-frame CNN
–  ST-CNN
–  Combined vector

Results
•  Action prediction
Time Attributes
Estimated Intention
Action
PC (0.82)
Read (0.11)
Predicted activity
Read (1.00)
PC (0.00)

Coluclusion
•  Action prediction approach within recognition and database
analysis
–  Concatenated vector of IDT, ST-CNN
–  Bayesian framework
–  Database

【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework