The document discusses continuous human action recognition in ambient assisted living scenarios. It proposes an approach that uses action zones, which are the most discriminative segments of an action, rather than whole action sequences. Action zones are learned from training data and used to recognize actions in continuous video streams using a sliding window approach. The recognition thresholds and parameters are optimized using an evolutionary algorithm to maximize recognition performance. The approach is validated on public datasets and aims to enable long-term continuous human behavior analysis in ambient assisted living environments.
Continuous human action recognition in ambient assisted living scenarios
1. Continuous Human Action Recognition
in Ambient Assisted Living Scenarios
International Workshop on Enhanced
Living Environments (ELEMENT 2014)
Würzburg, Germany
Alexandros Andre Chaaraoui
Francisco Flórez-Revuelta
2. Human action recognition in AAL
Human action recognition with a bag of key poses
Continuous human action recognition
Experimentation
Overview
3. Architecture for AAL
Camera 1
Motion
Detection
Human
Behaviour
Analysis
Multi-view Human Behaviour Analysis
Privacy
Reasoning
System
Alarm Actuators
Camera 2
Motion
Detection
Human
Behaviour
Analysis
Camera N
Motion
Detection
Human
Behaviour
Analysis
...
Setup and Profiles
DB (Activities,
Inhabitants,
Objects, ...)
Log
Event
Long-term
analysis
...
...
Environmental
Sensor Information
Caregiver
9. Original method
Chaaraoui, A.A.; Climent-Pérez, P.; Flórez-Revuelta, F.: Silhouette-based Human Action Recognition
using Sequences of Key Poses, Pattern Recognition Letters, 34(15):1799–1807, 2013.
Multi-view action recognition
Chaaraoui, A.A.; Climent-Pérez, P.; Flórez-Revuelta, F.: An Efficient Approach for Multi-view Human
Action Recognition Based on Bag-of-Key-Poses, Lecture Notes in Computer Science, 7559:29-40,
2012.
Evolutionary optimisation
Chaaraoui, A.A.; Flórez-Revuelta, F.: Optimizing human action recognition based on a cooperative
coevolutionary algorithm, Engineering Applications of Artificial Intelligence, Volume 31:116–125, 2014.
Incremental learning
Chaaraoui, A.A.; Flórez-Revuelta, F.: Adaptive Human Action Recognition With an Evolving Bag of Key
Poses, IEEE Transactions on Autonomous Mental Development, 6(2):139-152, 2014
Use of RGB-D data
Chaaraoui, A.A.; Padilla-López, J.R.; Climent-Pérez, P.; Flórez-Revuelta, F.: Evolutionary joint selection
to improve human action recognition with RGB-D devices, Expert Systems with Applications, 41(3):
786-794, 2014.
More information
10. The previous methods work with pre-segmented sequences
However, accurate recognition and outstanding temporal performance led us to
extend it for continuous scenarios
A sliding and growing window is used to process the continuous stream at
different overlapping locations and scales
A null class is considered in order to discard unknown actions and avoid false
positives. This class corresponds to all the behaviours that may be observed
and have not been modelled during the learning
Continuous human action recognition is performed by detecting and classifying
action zones
Continuous recognition
11. Action sequences may contain irrelevant segments which are common among
actions and therefore ambiguous for classication
Action zones = most discriminative segments with respect to the other action
classes in the course of an action
Action zones are shorter than the original sequences. Then, the matching time
will be signicantly reduced
This will allow to consider a larger number of sliding windows at every moment
Action zones
12. The bag of key poses is built similarly to the original method
The discrimination value of each key pose wkp is obtained:
For each training sequence of action class a and specific temporal instant t:
1. For each action class a, the nearest neighbour key pose kpa(t) is obtained
2. The raw class evidence values for all the classes
Learning of action zones
13. The bag of key poses is built similarly to the original method
The discrimination value of each key pose wkp is obtained:
For each training sequence of action class a and specific temporal instant t:
1. For each action class a, the nearest neighbour key pose kpa(t) is obtained
2. The raw class evidence values for all the classes
Learning of action zones
14. The bag of key poses is built similarly to the original method
The discrimination value of each key pose wkp is obtained:
For each training sequence of action class a and specific temporal instant t:
1. For each action class a, the nearest neighbour key pose kpa(t) is obtained
2. The raw class evidence values for all the classes
3. Normalisation is applied with respect to the highest value observed:
Learning of action zones
15. 4. Gaussian smoothing is performed centered in the current frame, considering
only the frames from a temporal instant u ≤ t
5. The final class evidence H (t ) is obtained by attenuating the resulting value:
Learning of action zones
16. 4. Gaussian smoothing is performed centered in the current frame, considering
only the frames from a temporal instant u ≤ t
5. The final class evidence H (t ) is obtained by attenuating the resulting value:
Action zones are detected by defining thresholds HT1(t), HT2(t),…, HTA(t)
So, for a sequence belonging to an action class, the action zone is determined
by the frames where
Learning of action zones
17. Then, the action zones for every learning sequence constitute the knowledge
base
A sliding and growing window is used to process the continuous stream at
different overlapping locations and scales
These segments of key poses are compared with the learned action zones
using DTW
In some cases, even the nearest key pose is very different to the input frame
Therefore, a set of threshold parameters DT1, DT2, …, DTA indicate the highest
allowed distance to trigger the recognition
If the match is not good enough, the frame is labelled as null class
Continuous recognition
18. Two sets of parameters to establish: HT1(t), HT2(t),…, HTA(t) and DT1, DT2, …,
DTA
Set with an evolutionary algorithm that finds the best performing combination for
both sets
Comparison with the ground truth is obtained at segment level
An action must be recognised with a delay lower than τ frames
Experimentation
19. Validation with the multi-view IXMAS and the single-view Weizmann datasets
The windows grows in 5-frame steps and when lengthmax is reached, it slides 10
frames
A delayed recognition is accepted for τ = 60 frames ≅ 2 seconds
Experimentation
NOTE:
Approach 1: Use of action zones
Approach 2: Use of the whole sequences