http://www.hirokatsukataoka.net/pdf/visapp16_kataoka_prediction.pdf
We present a technique to address the new challenge of activity prediction in computer vision field. In activity prediction, we infer the next human activity through "classified activities" and "activity data analysis". Moreover, the prediction should be processed in real-time to avoid dangerous or anomalous activities. The combination of space--time convolutional neural networks (ST-CNN) and improved dense trajectories (IDT) are able to effectively understand human activities in image sequences. After categorizing human activities, we insert activity tags into an activity database in order to sample a distribution of human activity. A naive Bayes classifier allows us to achieve real-time activity prediction because only three elements are needed for parameter estimation. The contributions of this paper are: (i) activity prediction within a Bayesian framework and (ii) ST-CNN and IDT features for activity recognition. Moreover, human activity prediction in real-scenes is achieved with 81.0% accuracy.
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework
1. Activity Prediction
Using a Space-Time CNN and Bayesian Framework
Hirokatsu KATAOKA, Yoshimitsu AOKI†, Kenji IWATA, Yutaka SATOH
National Institute of Advanced Industrial Science and Technology (AIST)
† Keio University
http://www.hirokatsukataoka.net/
2. Background
• Computer vision for human sensing
– Detection, tracking, trajectory analysis
– Posture estimation, action analysis
– Action recognition is able to extend human sensing applications
Mental state
Body Situation
Attention
Action Analysis
shakinghands
Look at people
Detection
Gaze Estimation
Action Recognition
Posture Estimation
Face Recognition
Trajectory extraction
Tracking
3. Related work 1: Action Recognition
• Action is a low-level primitive with semantic meaning
– e.g. walking, running, sitting
This image contains a man walking
- The classification (location is given)
Action recognition
Walking
4. Is action recognition enough?
Time-series
Post-detection
Event detection
(Action tag : Ai)
Time-series
Event prediction
(Prediction tag : Aj)
Pre-estimation
5. Related work 2: Early Action Recognition
• Prediction in early part of action
– Integral bag-of-words
– Accumulating likelihood through time-sequence
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on
Computer Vision (ICCV), pp.1036-1043, 2011.
6. Proposal
• Action prediction within a ST-CNN and Bayesian
framework
– Action recognition
– Database analysis
??? Daytime
(Time Zone)
Walking
(Previous Action)
Sitting
(Current Action)
???
(Next Action)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series
7. Problem settings
• Three different works in action analysis
– Action recognition
• Recognizing At given 1 ~ t frames
– Early action recognition
• Recognizing At given 1 ~ t-L frames
– Action Prediction
• Recognizing At+L given 1 ~ t frames
Approach Setting
Action Recognition
Early Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
8. Process flow
• Consist of (i) action recognition (ii) action prediction
1. Action recognition
1.1 Improved dense trajectories (IDT)
1.2 Space-time convolutional neural networks (ST-CNN)
2. Action prediction
2.1 Bayesian framework
2.2 Database
x x x x x x x x x x x x x x x
x x x
Trajectory (in t + L frames)
Feature extraction
(HOG, HOF, MBH, Traj.)
Bag-of-words (BoW)
Pedestrian detection IDT
Input
Conv
Conv
Pool
FC
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
ST-CNN Oxford VGG architecture (VGGNet)
9. Action Recognition (1/2)
• Improved Dense Trajectories (IDT) [Wang+, ICCV2013]
– Pyramidal image sequences and flow tracking
– Feature descriptors on trajectories
– Feature representation with bag-of-words (BoW)
sittingwalking
10. Action Recognition (1/2)
• IDT + Co-occurrence HOG [Kataoka+, ACCV2014]
CoHOG: edge-pair counting to corresponding histogram position
Extended CoHOG(ECoHOG): edge-magnitude accumulation
– PCA dim. reduction: 103 - 104 dims into 101-102 ,easy to divide in feature space
11. Action Recognition (2/2)
• Space-time Convolutional Neural Networks (ST-CNN)
– Based on VGG 16-layer architecture (VGGNet) [Simonyan+, ICLR2015]
– Statio-temporal feature concatenation (around 10 frames)
Space-time CNN (ST-CNN) Feature
Input
Conv
Conv
Pool
FC
FC
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
FC
So3max
・・・
CNN architecture with VGGNet
12. Action Prediction (1/2)
• Prediction model
- Action sequence
Predicting “Using a PC” at “Walk” => “Sit”
- Time zone (supplemental info.)
Day time
??? Daytime
(Time Zone)
Walking
(Previous Activity)
Sitting
(Current Activity)
???
(Next Activity)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series
13. • Database: ST-action tags + attribute
– Time zone
• “morning”, “day time”, “night”
– Previous & current action
• “walk”, “bend”, “stand”, “sit”…
– Next action (objective)
• “use a PC”, “read”, “meal”…
Action Prediction (2/2)
Action History DB
Walking
Sitting
Using a PC
Daytime
14. Experiments on the Daily Living Data
– Total 20h of video
– 3 different scenes
– 640x480, 30fps