SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
Action Recognition Using Nonnegative Action
Component Representation and Sparse
Basis Selection
Haoran Wang, Chunfeng Yuan, Weiming Hu, Haibin Ling, Wankou Yang, and Changyin Sun
Abstract—In this paper, we propose using high-level action
units to represent human actions in videos and, based on
such units, a novel sparse model is developed for human
action recognition. There are three interconnected components
in our approach. First, we propose a new context-aware spatial-
temporal descriptor, named locally weighted word context, to
improve the discriminability of the traditionally used local
spatial-temporal descriptors. Second, from the statistics of the
context-aware descriptors, we learn action units using the graph
regularized nonnegative matrix factorization, which leads to a
part-based representation and encodes the geometrical informa-
tion. These units effectively bridge the semantic gap in action
recognition. Third, we propose a sparse model based on a joint
l2,1-norm to preserve the representative items and suppress noise
in the action units. Intuitively, when learning the dictionary for
action representation, the sparse model captures the fact that
actions from the same class share similar units. The proposed
approach is evaluated on several publicly available data sets.
The experimental results and analysis clearly demonstrate the
effectiveness of the proposed approach.
Index Terms—Action unit, action recognition, sparse repre-
sentation, nonnegative matrix factorization.
I. INTRODUCTION
HUMAN action recognition has a wide range of appli-
cations such as video content analysis, activity sur-
veillance, and human-computer interaction [1]. As one of
the most active topics in computer vision, much work on
human action recognition has been reported [2]–[37]. In most
of the traditional approaches for human action recognition,
Manuscript received January 20, 2013; revised August 2, 2013 and
October 26, 2013; accepted November 12, 2013. Date of publication
November 25, 2013; date of current version December 17, 2013. This work
was supported in part by the Scientific Research Foundation of Graduate
School of Southeast University, in part by the Beijing Natural Science Foun-
dation under Grant 4121003, in part by the National 863 High-Tech Research
and Development Program of China under Grant 2012AA012504, and in part
by the Key Laboratory of Measurement and Control of Complex Systems
of Engineering, Ministry of Education. The associate editor coordinating the
review of this manuscript and approving it for publication was Dr. Dimitrios
Tzovaras.
H. Wang, W. Yang, and C. Sun are with the School of Automation, South-
east University, Nanjing 210096, China (e-mail: whr1fighting@gmail.com;
youngwankou@yeah.net; cysun@seu.edu.cn).
C. Yuan and W. Hu are with the National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
(e-mail: cfyuan@nlpr.ia.ac.cn; wmhu@nlpr.ia.ac.cn).
H. Ling is with the Department of Computer and Information Science,
Temple University, Philadelphia, PA 19122 USA (e-mail:
hbling@temple.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2013.2292550
Fig. 1. A single interest point may have multiple meanings in different
contexts. Some correlated interest points together can construct an action
unit which is more descriptive and discriminative. A video sequence can
be represented by a few action units, and each action class has its own
representative action units.
action models are typically constructed from patterns of low-
level features such as appearance patterns [27], [28], optical
flow [14], [15], space-time templates [16], [17], 2D shape
matching [18], [19], trajectory-based representation [22], [23]
and bag-of-visual-words (BoVW) [13], [34]. However, these
features can hardly characterize rich semantic structure in
actions.
Inspired by recent development in object classification
[38], [39], we introduce a high-level concept named “action
unit” to describe human actions, as illustrated in Fig. 1.
For example, the “golf-swinging” action contains some
representative motions, such as “arm swing” and “torso
twist”. They are hardly described by the low-level features
mentioned above. On the other hand, some correlated
space-time interest points, when combined together, can
characterize a representative motion. Moreover, the key frame
is important to describe an action; and a key frame may
be characterized by the co-occurrence of space-time interest
points extracted from the frame. The representative motions
and key frames both reflect some action units, which can
then be used to represent action classes. With the above
observation, we propose using high-level action units for
1057-7149 © 2013 IEEE
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 571
Fig. 2. Flowchart of the proposed framework.
human actions representation. Typically, from an input human
action video, hundreds of interest points are first extracted
and then agglomerated into tens of action units, which then
compactly represent the video. Such a representation is more
discriminative than traditional BoVW model. To use it for
action recognition, we address the following three main issues.
1. Selecting low-level features for generating the action
unit. Some of the aforementioned features require reliable
tracking or body pose estimation, which is hard to achieve
in practice. The interest-point-based representation circumvent
avoid such requirement while being robust to noise, occlusion
and geometric variation. But traditional bag-of-visual-words
models (BoVW) utilize only features from individual interest
points and ignore spatial-temporal context information. To
address this issue, we propose a new context-aware descriptor
that incorporates context information from neighboring interest
points. This way, the new descriptor is more discriminative and
robust than the traditional BoVW.
2. Building an action unit set to represent all action classes
under investigation. Nonnegative Matrix Factorization
(NMF) [43] has received considerable attention and has
been shown to capture part-based representation in the
human brain [40], [41] as well as vision tasks [42], [43].
We propose using graph regularized Nonnegative Matrix
Factorization (GNMF) [44] to encode the geometrical
information by constructing a nearest neighbor graph. It finds
a part-based representation in which two data points are
connected if they are sufficiently close to each other. The
GNMF-based action units are automatically learned from
the training data and are capable of capturing the intra-class
variation of each action class.
3. Choosing discriminative action units and suppressing
noise in action classes. We propose a new action unit selection
method based on a joint l2,1-norm minimization. We first
introduce the l2,1-norm for vectors. Sparse model based on
such norm is robust to outliers and the regularization can
guide selecting action units across intra-class samples. The
dictionary learning process captures the fact that actions from
the same class share similar action units.
In this work we target learning high-level action units to
represent and classify human actions. For this purpose, we
improve over the traditional interest point feature and propose
an action unit based solution, which is further improved by an
action unit selection procedure. Fig. 2 illustrates the flowchart
of our framework. In summary, the training phase learns the
model for action units and the action classifier on the action
unit-based representation. The testing phase uses the learned
model for action prediction.
In the rest of the paper, Sec. 2 reviews the related work.
Sec. 3 introduces the new context-aware descriptor as the low-
level feature. Sec. 4 presents the GNMF-based action unit
as the high-level feature. Sec. 5 proposes the joint l2,1-norm
based action unit selection and a supervised dictionary learning
method for classification. Sec. 6 demonstrates the experimental
results. Sec. 7 concludes this paper.
II. RELATED WORK
Action recognition has been widely explored in the com-
puter vision community. Recently, some attempts have been
made to use the mid- or high-level concepts for human
action recognition. Liu et al. [34] exploit mutual informa-
tion maximization techniques to learn a compact mid-level
codebook and use the spatial-temporal pyramid to exploit
temporal information. Fathi et al. [48] extract discriminative
flow features within overlapping space-time cells and select
mid-level features via AdaBoost. Unfortunately, the global
binning makes the representation sensitive to position or
time shifts in the clip segmentation, and using predetermined
fixed-size spatial-temporal grid bins assumes that the proper
volume scale is known and uniform across action classes. Such
uniformity is not inherent in the features themselves, given
the large differences between the spatial-temporal distributions
of the features for different activities. Wang et al. [45] use
the hidden conditional random fields for action recognition.
The authors model an action class as a root template and a
constellation of hidden “parts”, where the hidden “part” is
a group of local patches that are implicitly correlated with
some intermediate representation. Ramanan et al. [46] first
track the persons with 3D pose estimation, which is then
used for action recognition. Liu et al. [47] use diffusion
maps to automatically learn a semantic visual vocabulary from
abundant quantized mid-level features, each represented by the
vector of pointwise mutual information. But the vocabularies
are created for individual categories, thus they are not univer-
sal and general enough, which limits their applications. Liu
et al. [6] learn data-driven attributes as the latent variables.
The authors augment the manually-specified attributes with the
automatically learned attributes to provide a complete charac-
terization of human actions. Compared with traditional low-
level features [13], [55]–[57], it is obvious that human actions
are more effectively represented by considering multiple high-
level semantic concepts. However, the current learning-based
visual representations obtain labels from the entire video,
and hence include background clutters which may degenerate
learning effectiveness.
High-level concepts have also been applied to object recog-
nition. Farhadi et al. [39] use a set of semantic attributes such
as ‘hairy’ and ‘four-legged’ to identify familiar objects, and
to describe unfamiliar objects when images and bounding box
annotations are provided. Lampert et al. [50] show that high-
level descriptors in terms of semantic attributes can be used
572 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
to recognize object classes without any training image, once
semantic attribute classifiers are trained from other classes of
data. Vogel et al. [51] use attributes related to the scenes to
characterize image regions and combine these local semantic
attributes to form a global image description for natural scene
retrieval. Wang et al. [52] propose to represent an image by
its similarities to Flickr image groups which have explicit
semantic meanings. Classifiers are learned to predict the
membership of images to Flickr groups, and the class mem-
bership probabilities are used to define the image similarity.
Li et al. [53] build a semantically meaningful image hierarchy
by using both visual and semantic information, and represent
images by the estimated distributions of concepts over the
entire hierarchy. Torresani et al. [54] use the outputs of a large
number of object category classifiers to represent images.
Dictionary learning has been proven to be effective to
select discriminative low-level features for classification.
Liu et al. [3] use PageRank to mine the most informa-
tive static features. In order to further construct compact
yet discriminative visual vocabularies, a divisive information-
theoretic algorithm is employed to group semantically related
features. Brendel et al. [63] store multiple diverse exemplars
per activity class, and learn a sparse dictionary of most
discriminative exemplars. Qiu et al. [68] propose a Gaussian
process model to optimize the dictionary objective function.
The dictionary learning algorithm is based on sparse rep-
resentation which has recently received a lot of attention.
Mairal et al. [69] generalize the reconstructive sparse dictio-
nary learning process by optimizing the sparse reconstruction
jointly with a linear prediction model. Bradley and Bagnell
[70] propose a novel differentiable sparse prior rather than
the conventional L1 norm, and employ a back propagation
procedure to train the dictionary for sparse coding in order
to minimize the training error. These approaches need to
explicitly associate each sample with a label in order to
perform the supervised training. They aim at learning dis-
criminative sparse models instead of purely reconstructive
ones.
III. LOCALLY WEIGHTED WORD CONTEXT DESCRIPTOR
We propose a new context-aware descriptor called locally
weighted word context (LWWC) as the low-level descriptor.
LWWC encodes spatial context information rather than being
limited to a single interest point as used in traditional interest-
point-based descriptors. Such spatial context information is
extracted from neighboring of interest points, and can be
used to improve the robustness and discriminability of the
proposed descriptor.
We first perform the space-time interest point detection use
the methods in [12] and [13] for different data sets. These
interest points are initially described by histograms-of-optical-
flow (HOF) and histograms-of-oriented-gradients (HOG),
which respectively characterize the motion and appearance
within a volume surrounding an interest point. Afterwards,
we employ the k-means algorithm on these features to create
a vocabulary of size K. Following the BoVW, each interest
point is then converted to a visual word. Finally, for each
Fig. 3. Illustration of the locally weighted word context descriptor. (a) shows
the structure of this descriptor. LWWC is constructed by several neighboring
interest points together. “Dσ (p, qj )” is the distance between central interest
point and neighboring interest point. (b) shows the representation for the
central interest point by a K-by-1 vector applied to each local feature.
interest point together with its N − 1 nearest interest points,
the Locally Weighted Word Context descriptor (LWWC) is
calculated as following.
Let N(p) = {p, q1, · · · , qN−1} denote an interest point p
and its N − 1 nearest neighboring points. The N − 1 nearest
neighboring points are collected according to the normal-
ized Euclidean distance on their 3D position spatial-temporal
coordinates:
Dσ (p, q) =
i=1,2,3
1
σi
(p(i) − q(i))2
1/2
, (1)
where the three components in p and q record the horizontal,
vertical and temporal positions of the interest points respec-
tively; and σi is the corresponding weight.
Using the BoVW model, each point in N(p) is assigned
to a visual word. Therefore, the LWWC of N(p) is defined
as a vector of size K × 1 where K is the size of the
vocabulary. The k-th element element in the LWWC vector
is inversely proportional to the distance between the p and the
corresponding points in N. If multiple points in N belong to
the same visual word, their responses are summed together.
Specifically, we set the value to 1 corresponding to p, and
β
Dσ (p,qj ) for qj ∈ N. As shown in Fig. 3, the LWWC for
N(p) is denoted as follows:
Fp = [h1, h2, · · · , hK ]T
, (2)
hi =
1 if label(p) = i
N−1
j=1 β ·
δ(label(qj )−i)
Dσ (p,qj ) otherwise,
(3)
where label(p) denotes the word id of point p. Through
rebuilding the BOW model on the LWWC descriptors, each
action video is represented by a vector based on low-level
features.
IV. GNMF-BASED ACTION UNITS
NMF is a factorization algorithm for analyzing nonnegative
matrices. The nonnegative constraints allow only additive
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 573
Fig. 4. GNMF-based action units. (a) A video from class “walking” includes
two action units “translation of torso” and “leg motion”. (b) A video from
class “waving” includes two identical action units “arm motion”.
combinations of different bases. This is the most signifi-
cant difference between NMF and other matrix factorization
methods such as singular vector decomposition (SVD). NMF
can learn a part-based representation. But it fails to discover
the intrinsic geometrical and discriminative structure of the
data space, which is essential to the real-world applications.
If two data points are close in the intrinsic geometry of the
data distribution, the representations of these two points with
respect to the new bases should be still close to each other.
GNMF [44] aims to solve this problem. Most previous works
represent actions with low-level features. In this paper, we
propose to extract high-level action units based on GNMF to
better describe the human actions. The GNMF-based action
units are generated as follows.
Let y
j
i ∈ Rd, i = 1, · · · , C, j = 1, · · · , ni denote the
d-dimensional low-level feature representation of the j-th
video in class i. All such representations in class i form a
matrix Yi = [y1
i , · · · , yni
i ] ∈ Rd×ni . GNMF minimizes the
following objective function:
Q = Yi
− UV T
2
F
+ λTrace(V T
LV ) , (4)
where U ∈ Rd×si and V ∈ Rni ×si are two nonnegative
matrices, L = D − W the called graph Laplacian [62], W the
symmetric and nonnegative similarity matrix. We adopt the
heat kernel weight Wjl = e− 1
δ y
j
i −yl
i
2
. D is a diagonal matrix
whose entries are column (or row, since W is symmetric)
sums of W. Each column of matrix V T is a low-dimensional
representation of the corresponding column of Yi with respect
to the new bases. We define the column vectors of matrix U
as the action units belonging to action class i. Each element
of the column vector corresponds to a visual word which is
obtained from the low-level descriptors. Each column vector
of matrix U is a semantic representation constructed by several
correlated visual words. Repeating the same process for each
action class, we obtain all the class-specific action units.
As an example, suppose we have four action units forming
the bases: “translation of torso”, “up-down torso motion”,
“arm motion”, and “leg motion”. Then the action class
“walking” may be represented by an action-unit-based vec-
tor [1, 0, 1, 1] ∈ R4, and “waving” may be represented
by vector [0, 0, 2, 0] ∈ R4, with each dimension indicat-
ing the degree of the corresponding action unit as shown
in Fig. 4.
The action-unit-based representation has two main advan-
tages. First, it is compact since only tens of action units
to describe an action video. This is more efficient in than
BoVW models where hundreds of interest points are needed.
Second, some low-level local features are not discriminative,
and even have negative influence on classification. The process
of learning class-specific action units can suppress such noises.
The matrix factorization algorithm extracts the representative
action units for each action class. The representative action
units should exist in all the videos belonging to the same
action class. Some low-level local features that only exist
in a few intra-class videos are suppressed by the algorithm
mentioned above, and are not used for constructing the high-
level action units. The learned class-specific action units
can exhibit the characteristic of each action class. So, the
proposed action-unit-based representation is more powerful for
classification.
V. ROBUST ACTION UNIT SELECTION BASED ON
JOINT l2,1-NORMS
Recently sparse representation has received a lot of attention
in computer vision. Typically, it approximates the input signal
in terms of a sparse linear combination of the given over-
complete bases in dictionary. Such sparse representations
are usually derived by linear programming as an l1-norm
minimization problem. But the l1-norm based regularization
is sensitive to outliers. Inspired by the l2,1-norm of a matrix
[58], we first introduce the l2,1-norm of a vector. Moreover, we
propose a new joint l2,1-norm based sparse model to select the
representative action units for each action class. The proposed
sparse model mainly has two advantages for classification-
based action unit selection. First, the l2,1-norm of the matrix
in our sparse model encourages that the samples from the
same action class are constructed by similar action units,
and the action units which only appear in several intra-class
samples are suppressed. Second, each action class has its own
representative action units. The l2,1-norm of the vector in our
sparse model encourages each sample is constructed by the
action units from the same class.
A. Notations and Definitions
We first introduce the notations and the definition of norms.
For a matrix Z, its j-th row, k-th column are denoted by Z j·,
Z·k respectively, and z jk is an element in the j-th row and the
k-th column. The l2,1-norm of a matrix is introduced in [58]
as a rotational invariant l1-norm and also used for multi-task
learning [59], [60] and tensor factorization [61]. It is defined
as:
Z 2,1 =
n
j=1
z j· 2
=
n
j=1
r
k=1
z2
jk
1
2
. (5)
We first introduce the l2,1-norm of a vector. For
a vector b = [b1, b2, · · · , bn]T , the elements are
divided into G groups according to some rules, and
the number of elements in group g is mg, b =
[b11, · · · , b1m1 , · · · , bg1, · · · , bgmg , · · · , bG1, · · · , bGmG ]T .
The l2,1-norm of the vector b is defined as:
b 2,1 =
G
g=1
mg
k=1
b2
gk
1
2
. (6)
574 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
B. Problem Formulation
Following the notation in Sec. IV and assume that the i-the
action class has mi learned action units, and C
i=1 mi = m.
We initialize the dictionary B = [B1, B2, . . . , BC ] such that
Bi = [bi1, . . . , bimi ], where bij denotes the j-th action unit
of the i-th class. The proposed sparse model for action-unit-
selection is
min
B,Xi
C
i=1
Yi
− BXi 2
F + γ1
k
Xi
·k 2,1+γ2 Xi
2,1 , (7)
where · F denotes the Frobenius-norm and · 2,1 the
l2,1-norm. Xi
·k 2,1
penalizes each group of elements,
corresponding to the columns of Bi, in vector Xi
·k as a whole
and enforces sparsity among the groups, so it encourages that
each action video is constructed by the action units from the
same class. Xi
2,1
penalizes each row of matrix Xi as a
whole and enforces sparsity among the rows, so it encourages
the videos from the same action class are constructed by sim-
ilar action units. γ1 and γ2 are the regularization parameters.
For classification, with a sparse model φ(yt , B), a predictive
model f (xt) = f (φ(yt, B)), a class label lt of the action
video yt, and a classification loss L(lt , f (xt ))= lt − f (xt) 2
2,
we desire to train the whole system with respect to the
dictionary B given P training samples.
min
B
E = min
B
P
t=1
L(lt, f (φ(yt , B))). (8)
The dictionary optimization is carried out using an iterative
approach composed of two steps: the sparse coding step on a
fixed B and the dictionary update step on fixed Xi .
Step 1: Sparse coding. Taking derivatives with respect to
Xi
·k(1 ≤ k ≤ ni ) according to (7) and setting it to zero, we
have
2BT
BXi
·k − 2BT
Yi
·k + γ1Dk Xi
·k + γ2E Xi
·k = 0 , (9)
where
E = diag
1
Xi
1· 2
,
1
Xi
2· 2
, · · · ,
1
Xi
m· 2
, (10)
Dk = diag(w1I(m1), w2I(m2), · · · , wC I(mC )) , (11)
where wj =
Mj
p=1+Mj −m j
(Xi
p,k)2 −1/2
, Mj =
j
l=1 ml,
I(m j ) is the m j × m j identity matrix, and diag (·) denotes a
diagonal matrix formed from elements of the vector.
From Equ. (9), we get
Xi
·k = 2(2BT
B + γ1Dk + γ2 E)−1
BT
Yi
·k . (12)
Note that Dk and E depend on Xi and thus are also unknown
variables. An iterative algorithm is proposed to solve this
problem, which is illustrated in Algorithm 1.
Step 2: Dictionary updating. Minimizing the loss func-
tion E over B will tighten the learned dictionary with the
classification model, and therefore improve the classification
Algorithm 1 An iterative algorithm for sparse coding
effectiveness. We compute the gradient of E with respect to B
according to Equ. (8):
∇B E =
P
t=1
∇B L =
P
t=1
∇ f L ·∇B f =
P
t=1
∇ f L ·∇xt f ·∇Bxt .
(13)
Therefore, the problem is reduced to compute the gradient of
the sparse representation xt with respect to the dictionary B.
According to Equ. (12), we have
∇B Xi
·k = (2BT
B + γ1Dk + γ2 E)−1
·
2∇B(BT
Yi
·k)−(2∇B(BT
B)+γ1∇BDk +γ2∇BE)Xi
·k . (14)
Through the above process, we obtain the optimized
dictionary.
For a test sample y, we get the action-unit-based represen-
tation x by
min y − Bx 2
F + γ1 x 2,1 . (15)
SVM is adopted as the predictive model for discriminative
dictionary learning and classification, and we employ the
generalized Gaussian kernel with the χ2 distance kernel, i.e.,
K(Hi, Hj) = exp −
1
A
χ2
(Hi, Hj) , (16)
for two histograms Hi, Hj, where A is the scale parameter set
as the mean distance between training samples.
VI. EXPERIMENTAL RESULTS
A. Data Sets
Five action data sets are used in our evaluation: the
KTH action data set [31], the UCF Sports data set [10], the
UT-Interaction data set [21], the UCF YouTube data set [3],
and the Hollywood2 data set [73]. Examples of these data sets
are shown in Fig. 5.
The KTH data set contains six single person actions
(boxing, hand waving, hand clapping, walking, jogging, and
running) performed repeatedly by 25 persons in four different
scenarios: outdoors, outdoors with camera zoom, outdoors
with different clothes, and indoors.
The UCF Sports is a challenging data set collecting a large
set of action clips from various broadcast sport videos. Actions
include diving, golf swinging, kicking, lifting, horseback
riding, running, skating, swinging, and walking. The actions
are captured in a wide range of scenes and view points.
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 575
Fig. 5. Representative frames from videos in five data sets. From top to
bottom: the KTH data set, the UCF Sports data set, the UT-Interaction data
set, the UCF YouTube data set, and the Hollywood2 data set.
The UT-Interaction data set has been used in the first
Contest on Semantic Description of Human Activities [21].
This data set contains action sequences of six interactions:
hug, kick, point, punch, push, and hand-shake. For classifica-
tion, 120 video segments cropped based on the ground-truth
bounding boxes and time intervals are provided by the data set
organizers. These segments are further divided into two sets,
and each set has 60 segments with 10 segments per class. Set 1
is captured at a parking lot and set 2 at a lawn.
The UCF YouTube data set contains 11 action categories:
basketball shooting, biking/cycling, diving, golf swinging,
horse back riding, soccer juggling, swinging, tennis swinging,
trampoline jumping, volleyball spiking, and walking with a
dog. This data set is challenging due to large variations in
camera motion, object appearance and pose, object scale,
viewpoint, cluttered background and illumination conditions.
The Hollywood2 data set has been collected from 69 differ-
ent Hollywood movies. There are 12 action classes: answering
the phone, driving car, eating, flighting, getting out of car, hand
shaking, hugging, kissing, running, sitting down, sitting up,
and standing up. In our experiments, we use the clean training
data set. In total, there are 1707 action samples divided into a
training set and a test set. Train and test sequences come from
different movies.
For the KTH and UT-Interaction data sets, the Harris3D
detector [13] is used for interest point extraction, and the
cuboid detector [12] is adopted for the other data sets.
B. Effects of the LWWC Descriptor
In our approach, we adopt the LWWC descriptor at low
level. Different from traditional interest-point-based methods,
the proposed descriptor contains the information of an area,
rather than a single interest point, by describing the distribution
of neighboring interest points.
Experiments are conducted to evaluate the influence of the
neighborhood information in the LWWC descriptor. Fig. 6
illustrates the recognition rates corresponding to different
scales of neighborhood information covered by the proposed
descriptor on the KTH and the challenging UT-Interaction
data sets. Traditional interest point based methods only utilize
features of a single interest point. It can only describe a very
small area. So the accuracy can be easily influenced by noise.
The recognition rate is 93.99% on the KTH data set. When
Fig. 6. Performance of LWWC on the KTH and UT-Interaction data sets.
Fig. 7. Performance of action unit selection on KTH and UT-Interaction.
the features of neighboring interest points are involved, the
proposed descriptor describes a larger area, and makes use
of more neighborhood information. So, the recognition rate is
raised to 95.49% when 8 nearest neighboring interest points
are collected for the LWWC descriptor. To further validate
the effectiveness of our descriptor, we also conduct similar
experiments on the UT-Interaction data set. In set 1, when we
adopt the feature of a single interest point, the accuracy is
only 78.3%. The accuracy is raised to 81.7% when 4 nearest
neighboring interest points are collected. In set 2, the accuracy
is raised to 80.0% from 68.3% when 8 nearest neighboring
interest points are collected. These experimental results show
that the recognition rate is improved when the neighborhood
information is incorporated into the LWWC descriptor.
C. Analysis of the Action Unit Selection
Based on low-level descriptors, the action units are learned
through GNMF. Among the learned action units, our proposed
joint l2,1-norm based sparse model aims at selecting the class-
specific representative action units to improve the recognition
performance. It encourages that the actions from the same
class are described by the same action units, and each action
is described by the action units from the same class.
To evaluate action unit selection, we compare the perfor-
mances of the original GNMF-based action units with the one
using action unit selection, as illustrated in Fig. 7. On the
KTH data set, the action unit selection significantly boosts the
performance from 92.65% to 95.49%. On the UT-Interaction
data set, it again significantly boost the recognition accuracies
from 80.0% to 81.7% (on Set 1) and from 70.0% to 80.0%
(on Set 2). These results clearly validate that the proposed
joint l2,1-norm based action unit selection method is effective
to improve the recognition performance.
576 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
Fig. 8. Confusion matrix of the classification on the KTH data set.
TABLE I
COMPARISON WITH PREVIOUS METHODS ON THE
KTH DATA SET
D. Experiments on the KTH Data Set
Consistent with the experiment setting used in
[13], [31], [34], we test the proposed approach on the
entire KTH data set [31], in which videos of four scenarios
are mixed together. We split the data set into a training part
with 24 persons’ videos and a test part with the remaining
videos. The final result is the average of 25 times runs.
For the sparse model based action unit selection, we set the
tradeoff parameters γ1 = γ2 = 0.2.
Fig. 8 presents the confusion matrix of the classification on
the KTH data set. It shows that our approach works excellently
on most actions such as “hand waving” and “boxing”. The
main confusion occurs between “jogging” and “running”,
since the actions performed by some actors are very similar.
Table I lists the average accuracies of our method and other
recently proposed ones. It shows that our method achieves
excellent performance (95.49%), which is comparable to the
best reported results.
The experimental results validate the effectiveness of the
proposed method. Furthermore, we compare the performances
of different baseline approaches (such as traditional single
interest point feature, the proposed LWWC descriptor, l1-norm
based sparse model, and our action unit selection approach),
and study the contribution of each part in our method, as
illustrated in Table II. The accuracy of traditional single
TABLE II
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE KTH DATA SET
TABLE III
CLASSIFICATION ACCURACIES ON THE KTH DATA SET
interest point feature is 91.80%. If we only use the LWWC
descriptor, the accuracy is 92.65%. When we only adopt the
action unit selection based on the traditional single interest
point feature, the accuracy is 93.99%. Combining both, the
accuracy reaches 95.49%. When we use the l1-norm based
on LWWC descriptors, the accuracy is 92.82%. The study
demonstrates that each of the proposed approaches offers
more discriminative power than the BoVW baseline, and the
l2,1-norm based action unit selection approach obtains better
performance than the l1-norm based sparse model. It further
validates the effectiveness of the high-level descriptor for
classification. Our method, which combines the low-level
LWWC descriptor with the high-level action unit selection,
achieves the best performance. Furthermore, our method is
compared with some basic cases, as shown in Table III. The
comparisons validate that the proposed method improves the
performance of traditional methods.
E. Experiments on the UCF Sports Data Set
Most previously reported results on this data set use leave-
one-out (LOO) manner, cycling each example as the test
video one at a time. But Lan et al. [76] propose to split the
data set into disjoint training and testing sets to avoid the
background regularity for evaluation. We report our results in
both protocols. For the realistic data set, we perform dense and
multi-scale interest point extraction. To generate the codebook,
we empirically set the codebook size k to 1000. For the
sparse model based action unit selection, we set the tradeoff
parameters γ1 = γ2 = 0.2.
Fig. 9 presents the confusion matrix across all scenarios
in the leave-one-out protocol on the UCF Sports data set.
Our method works well on most actions. For example, the
recognition accuracies for some actions are high up to 100%,
such as “diving” and “lifting”. There are complex backgrounds
in this data set, and some actions are very similar and
challenging for recognition, such as “golfing”, “horseback
riding”, and “running”. We conduct further experiments on
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 577
Fig. 9. Confusion matrix on the UCF Sports data set (LOO).
TABLE IV
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE UCF SPORTS DATA SET (SPLIT)
TABLE V
COMPARISON WITH PREVIOUS METHODS ON THE
UCF SPORTS DATA SET
the UCF Sports data set to study different components of the
proposed approach, similar as on the KTH data set. Table IV
illustrates the comparison of the two proposed approaches
with their combination in Lan’s manner (Split). The effec-
tiveness of each proposed ingredients is again confirmed as
the results on the ETH data set. Table V compares the overall
mean accuracy of our method with the results reported by
previous researchers. Our average recognition accuracy is
competitive with most reported results except the action bank
method [74].
F. Experiments on the UT-Interaction Data Set
The action videos in UT-Interaction are divided into two
sets. To generate the codebook, we empirically set the code-
book size k to 500 in set 1, and set the codebook size
Fig. 10. Confusion matrices on the UT-Interaction data set. (a) Set 1.
(b) Set 2.
TABLE VI
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE UT-INTERACTION DATA SET
TABLE VII
COMPARISON WITH PREVIOUS METHODS ON THE UT-INTERACTION
DATA SET. THE THIRD AND FOURTH COLUMNS REPORT THE
RECOGNITION RATE USING THE FIRST HALF
AND THE ENTIRE VIDEOS RESPECTIVELY
k to 300 in set 2. We set the tradeoff parameters γ1 =
γ2 = 0.2 for both sets. We perform the leave-one-out test
strategy.
Fig. 10 presents the confusion matrices across all scenarios
in set 1 and set 2. The recognition accuracies for some actions
are excellent, such as “point” in set 2 and “push”. Table VI
illustrates the comparison of the two proposed approaches with
their combination in both set 1 and set 2. Each approach offers
more discriminative power than traditional single interest
point feature. The LWWC descriptor performs better than the
action unit selection method in both sets, and the combina-
tion of them provides the best performance. The l2,1-norm
based action selection method outperforms the l1-norm based
sparse model. Also, the result of our method is compared
with those reported by previous researchers in Table VII,
and some basic cases in Table VIII. The proposed method
578 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
TABLE VIII
CLASSIFICATION ACCURACIES OF DIFFERENT METHODS ON THE
UT-INTERACTION DATA SET (SET 1 AND SET 2)
Fig. 11. Confusion matrix on the UCF YouTube data set.
Fig. 12. Performance comparisons among some relative interest-point-based
approaches on the UCF YouTube Data Set.
outperforms most other methods and achieves a competitive
result.
G. Experiments on the UCF YouTube Data Set
We follow the original setup [3] using leave one out cross
validation for a pre-defined set of 25 folds, and perform dense
interest point extraction. The codebook size is set to 1000, and
the tradeoff parameters γ1 = γ2 = 0.3.
Fig. 11 presents the confusion matrix across all scenarios on
the UCF YouTube data set. In Fig. 12, we compare per action
class performances of relative methods including cuboid,
TABLE IX
ACCURACIES ON THE UCF YOUTUBE DATA SET
Fig. 13. Performance of different versions of the proposed approach on the
Hollywood2 data set.
TABLE X
COMPARISON WITH PREVIOUS METHODS ON THE
HOLLYWOOD2 DATA SET
LWWC, action unit selection and some previously reported
interest-point-based methods, and our method outperforms
others. Table IX compares the overall mean accuracy of our
method with the reported by previous researchers. Our average
recognition accuracy is 82.2%, which is comparable to the
state-of-the-art performance and outperforms other interest-
point-based methods.
H. Experiments on the Hollywood2 Data Set
Similar to the parameters used in the YouTube data set, the
codebook size k is empirically set to 1000, and we set the
tradeoff parameters γ1 = γ2 = 0.3.
Fig. 13 presents the performance of our method, and
tests the contribution of each part in our method to the
recognition accuracy respectively. The accuracy of traditional
single interest point feature is 47.9%. If we only utilize the
LWWC descriptor, the accuracy is 50.1%. When we only
adopt the action unit selection based on the traditional single
interest point feature, the accuracy is 54.5%. In combination,
the accuracy of our method is 56.8%. Table X compares the
overall mean accuracy of our method with the results reported
by previous researchers. Our average recognition accuracy is
better than or comparable to the state-of-the-art performances.
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 579
VII. CONCLUSION
In this paper, we have proposed to represent human actions
by a set of intermediate concepts called action units which
are automatically learned from the training data. At low level,
we have presented a locally weighted word context descriptor
to improve the traditional interest-point-based representation.
The proposed descriptor incorporates the neighborhood infor-
mation effectively. At high level, we have introduced the
GNMF-based action units to bridge the semantic gap in
action representation. Furthermore, we have proposed a new
joint l2,1-norm based sparse model for action unit selection
in a discriminative fashion. Extensive experiments have been
carried out to validate our claims and have confirmed our
intuition that the action unit based representation is critical
for modeling complex activities from videos.
REFERENCES
[1] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine
recognition of human activities: A survey,” IEEE Trans. Circuits Syst.
Video Technol., vol. 18, no. 11, pp. 1473–1488, Sep. 2008.
[2] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human
action categories using spatial-temporal words,” in Proc. Int. J. Comput.
Vis., vol. 79, no. 3, pp. 299–318, Sep. 2008.
[3] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos
‘in the wild’,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2009, pp. 1996–2003.
[4] J. Niebles, C. Chen, and L. Fei-Fei, “Modeling temporal structure of
decomposable motion segments for activity classification,” in Proc. Eur.
Conf. Comput. Vis., 2010, pp. 392–405.
[5] H. Wang, C. Yuan, W. Hu, and C. Sun, “Supervised class-specific
dictionary learning for sparse modeling in action recognition,” Pattern
Recognit., vol. 45, no. 11, pp. 3902–3911, 2012.
[6] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by
attributes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2011, pp. 3337–3344.
[7] J. Liu, S. Ali, and M. Shah, “Recognizing human actions using multiple
features,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2008, pp. 1–8.
[8] K. Rapantzikos, Y. Avrithis, and S. Kollias, “Dense saliency-based
spatiotemporal feature points for action recognition,” in Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1454–1461.
[9] W. Lee and H. Chen, “Histogram-based interest point detectors,” in
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009,
pp. 1590–1596.
[10] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-
temporal maximum average correlation height filter for action recogni-
tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2008,
pp. 1–8.
[11] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation
of local spatio-temporal features for action recognition,” in Proc. Brit.
Mach. Vis. Conf., 2009, pp. 1–11.
[12] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition
via sparse spatiotemporal features,” in Proc. 2nd Joint IEEE Int. Work-
shop Vis. Surveill. Perform. Eval. Track. Surveill., Oct. 2005, pp. 65–72.
[13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning
realistic human actions from movies,” in Proc. IEEE Int. Conf. Comput.
Vis. Pattern Recognit., Jun. 2008, pp. 1–8.
[14] S. Ali and M. Shah, “Human action recognition in videos using
kinematic features and multiple instance learning,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 32, no. 2, pp. 288–303, Feb. 2010.
[15] A. F. Bobick and J. W. Davis, “The recognition of human movement
using temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 23, no. 3, pp. 1257–1265, Mar. 2001.
[16] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,”
in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2005,
pp. 984–989.
[17] M. Blank, M. Irani, and R. Basri, “Actions as space-time shapes,” in
Proc. 10th IEEE ICCV, Oct. 2005, pp. 1395–1402.
[18] Z. Lin, Z. Jiang, and L. S. Davis, “Recognizing actions by shape-
motion prototype trees,” in Proc. IEEE 12th Int. Conf. Comput. Vis.,
Sep./Oct. 2009, pp. 444–451.
[19] F. Lv and R. Nevatia, “Single view human action recognition using key
pose matching and viterbi path searching,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2007, pp. 1–8.
[20] H. Wang, C. Yuan, G. Luo, W. Hu, and C. Sun, “Action recognition
using linear dynamic systems,” Pattern Recognit., vol. 46, no. 6,
pp. 1710–1718, 2013.
[21] M. S. Ryoo and J. K. Aggarwal. (2010). An Overview of Contest on
Semantic Description of Human Activities (SDHA) [Online]. Available:
http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
[22] M. Raptis and S. Soatto, “Tracklet descriptors for action modeling and
video analysis,” in Proc. ECCV, 2010, pp. 577–590.
[23] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S. Chua, and J. Li, “Hierarchical
spatio-temporal context modeling for action recognition,” in Proc. IEEE
Int. Conf. CVPR, Jun. 2009, pp. 2004–2011.
[24] M. Bregonzio, S. Gong, and T. Xiang, “Recognising action as clouds of
space-time interest points,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 1948–1955.
[25] S. F. Wong, T. K. Kim, and R. Cipolla, “Learning motion categories
using both semantic and structural information,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2007, pp. 1–6.
[26] S. Savarese, A. Delpozo, J. Niebles, and L. Fei-Fei, “Spatial-temporal
correlations for unsupervised action classification,” in Proc. IEEE Work-
shop Motion Video Comput., Jan. 2008, pp. 1–8.
[27] I. Kotsia, S. Zafeiriou, and I. Pitas, “Texture and shape information
fusion for facial expression and facial action unit recognition,” Pattern
Recognit., vol. 41, no. 3, pp. 833–851, 2008.
[28] H. Jiang, M. Crew, and Z. Li, “Successive convex matching for action
detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2006, pp. 1646–1653.
[29] A. Klaser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptor
based on 3D-gradients,” in Proc. Brit. Mach. Vis. Conf., 2008, pp. 1–10.
[30] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional SIFT descriptor
and its application to action recognition,” in Proc. ACM 15th Int. Conf.
Multimedia, 2007, pp. 357–360.
[31] C. Schuldt, I. Laptive, and B. Caputo, “Recognizing human actions:
A local SVM approach,” in Proc. IEEE 17th ICPR, vol. 3. Aug. 2004,
pp. 32–36.
[32] Z. Zhang, Y. Hu, S. Chan, and L. Chia, “Motion context: A new
representation for human action recognition,” in Proc. ECCV, 2008,
pp. 817–829.
[33] Y. Wang and G. Mori, “Human action recognition by semi-latent topic
models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10,
pp. 1762–1774, Oct. 2009.
[34] J. Liu and M. Shah, “Learning human actions via information maxi-
mization,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[35] S. Ali, A. Basharat, and M. Shah, “Chaotic invariants for human action
recognition,” in Proc. IEEE 11th ICCV, Oct. 2007, pp. 1–8.
[36] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
system for action recognition,” in Proc. IEEE 11th ICCV, Oct. 2007,
pp. 1–8.
[37] K. Schindler and L. Gool, “Action snippets: How many frames does
human action recognition require?” in Proc. IEEE Int. Conf. CVPR,
Jun. 2008, pp. 1–8.
[38] T. Berg, A. Berg, and J. Shih, “Automatic attribute discovery
and characterization from noisy web data,” in Proc. ECCV, 2010,
pp. 663–676.
[39] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects
by their attributes,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 1778–1785.
[40] S. E. Palmer, “Hierarchical structure in perceptual representation,”
Cognit. Psychol., vol. 9, no. 4, pp. 441–474, 1977.
[41] E. Wachsmuth, M. W. Oram, and D. I. Perrett, “Recognition of objects
and their component parts: Responses of single units in the temporal
cortex of the macaque,” Cereb. Cortex, vol. 4, no. 5, pp. 509–522, 1994.
[42] P. Paatero and U. Tapper, “Positive matrix factorization: A nonnegative
factor model with optimal utilization of error estimates of data values,”
Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
[43] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative
matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999.
[44] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
matrix factorization for data representation,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
[45] Y. Wang and G. Mori, “Max-margin hidden conditional random fields for
human action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 872–879.
[46] D. Ramanan and D. Forsyth, “Automatic annotation of everyday move-
ments,” in Advances in Neural Information Processing Systems. Cam-
bridge, MA, USA: MIT Press, 2003.
580 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
[47] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies
using diffusion distance,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 461–468.
[48] A. Fathi and G. Mori, “Action recognition by learning mid-level motion
features,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[49] H. Wang, A. Klaser, C. Schmid, and C. Liu “Dense trajectories and
motion boundary descriptors for action recognition,” Int. J. Comput.
Vis., vol. 103, no. 1, pp. 60–79, 2013.
[50] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen
object classes by between-class attribute transfer,” in Proc. IEEE Int.
Conf. CVPR, Jun. 2009, pp. 951–958.
[51] J. Vogel and B. Schiele, “Semantic modeling of natural scenes for
content-based image retrieval,” Int. J. Comput. Vis., vol. 72, no. 2,
pp. 133–157, 2007.
[52] G. Wang, D. Hoiem, and D. Forsyth, “Learning image similarity from
Flickr groups using stochastic intersection kernel machines,” in Proc.
IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 428–435.
[53] L. J. Li, C. Wang, Y. Lim, D. Blei, and F. Li, “Building and using
a semantivisual image hierarchy,” in Proc. IEEE Int. Conf. CVPR,
Jun. 2010, pp. 3336–3343.
[54] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient object category
recognition using classemes,” in Proc. Eur. Conf. Comput. Vis., 2010,
pp. 776–789.
[55] A. Gilbert, J. Illingworth, and R. Bowden, “Fast realistic multi-action
recognition using mined dense spatio-temporal features,” in Proc. IEEE
12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 925–931.
[56] A. Kovashka and K. Grauman, “Learning a hierarchy of discriminative
space-time neighborhood features for human action recognition,” in
Proc. IEEE Int. Conf. CVPR, Jun. 2010, pp. 2046–2053.
[57] C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank, “3D R transform
on spatio-temporal interest points for action recognition,” in Proc. IEEE
Conf. CVPR, Jun. 2013, pp. 724–730.
[58] C. Ding, D. Zhou, X. He, and H. Zha, “R1-PCA: Rotational invariant
L1-norm principal component analysis for robust subspace factoriza-
tion,” in Proc. 23rd ICML, 2006, pp. 281–288.
[59] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”
in Advances in Neural Information Processing Systems. Cambridge, MA,
USA: MIT Press, 2007.
[60] G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature selection,”
Dept. Statist., Univ. California, Berkeley, CA, USA, Tech. Rep., 2006.
[61] H. Huang and C. Ding, “Robust tensor factorization using R1 norm,” in
Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[62] F. R. K. Chung, Spectral Graph Theory. Providence, RI, USA: AMS,
1997.
[63] W. Brendel and S. Todorovic, “Activities as time series of human
postures,” in Proc. ECCV, 2010, pp. 721–734.
[64] B. Li, M. Ayazoglu, T. Mao, O. Camps, and M. Sznaier, “Activity
recognition using dynamic subspace angles,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2011, pp. 3193–3200.
[65] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical
invariant spatio-temporal features for action recognition with indepen-
dent subspace analysis,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recognit., Jun. 2011, pp. 3361–3368.
[66] L. Yeffet and L. Wolf, “Local trinary patterns for human action recog-
nition,” in Proc. IEEE 12th ICCV, Sep./Oct. 2009, pp. 492–497.
[67] A. Yao, J. Gall, and L. Van Gool, “A Hough transform-based voting
framework for action recognition,” in Proc. IEEE Int. Conf. CVPR,
Jun. 2010, pp. 2061–2068.
[68] Q. Qiu, Z. Jiang, and R. Chellappa, “Sparse dictionary-based repre-
sentation and recognition of action attributes,” in Proc. IEEE ICCV,
Nov. 2011, pp. 707–714.
[69] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised
dictionary learning,” in Advances in Neural Information Processing
Systems. Cambridge, MA, USA: MIT Press, 2008.
[70] D. M. Bradley and J. A. Bagnell, “Differential sparse coding,” in
Advances in Neural Information Processing Systems. Cambridge, MA,
USA: MIT Press, 2008.
[71] J. Liu, Y. Yang, I. Saleemi, and M. Shah, “Learning semantic features for
action recognition via diffusion maps,” Comput. Vis. Image Understand.,
vol. 116, no. 3, pp. 361–377, 2012.
[72] N. Ikizler-Cinbis and S. Sclaroff, “Object, scene and actions: Combining
multiple features for human action recognition,” in Proc. ECCV, 2010,
pp. 494–507.
[73] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in Proc.
IEEE Int. Conf. CVPR, Jun. 2009, pp. 2929–2936.
[74] S. Sadanand and J. Corso, “Action bank: A high-level representation
of activity in video,” in Proc. IEEE Int. Conf. CVPR, Jun. 2012,
pp. 1234–1241.
[75] M. Raptis and L. Sigal, “Poselet key-framing: A model for human
activity recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013,
pp. 2650–2657.
[76] T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models
for joint action localization and recognition,” in Proc. IEEE ICCV,
Nov. 2011, pp. 2003–2010.
[77] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part
models for action detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013,
pp. 1–8.
[78] M. Ryoo, “Human activity prediction: Early recognition of ongoing
activities from streaming videos,” in Proc. IEEE ICCV, Nov. 2011,
pp. 1036–1043.
[79] Y. Zhang, X. Liu, M. Chang, X. Ge, and T. Chen, “Spatio-temporal
phrases for activity recognition,” in Proc. ECCV, 2012, pp. 707–721.
[80] A. Vahdat, B. Gao, M. Ranjbar, and G. Mori, “A discriminative key
pose sequence model for recognizing human interactions,” in Proc. IEEE
ICCV Workshops, Nov. 2011, pp. 1729–1736.
[81] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of
actions with actoms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
no. 11, pp. 2782–2795, Nov. 2013.
[82] M. Ullah, S. Parizi, and I. Laptev, “Improving bag-of-features action
recognition with non-local cues,” in Proc. Brit. Mach. Vis. Conf., 2010,
pp. 1–11.
[83] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action
parts from mid-level video representations,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2012, pp. 1242–1249.
Haoran Wang received the B.S. degree from the
Department of Information Science and Technology,
Northeast University, Shenyang, China, in 2008.
He is a Ph.D. student in School of Automation,
Southeast University, Nanjing, China. His research
interests include action recognition, motion analysis,
and event detection
Chunfeng Yuan received the B.S. and M.S. degrees
in computer science from the Qingdao University
of Science and Technology, China, in 2004 and
2007, respectively, and the Ph.D. degree in computer
science from the Institute of Automation (CASIA),
Chinese Academy of Sciences, Beijing, China, in
2010. She was an Assistant Professor at CASIA.
Her research interests and publications range from
statistics to computer vision, including sparse rep-
resentation, motion analysis, action recognition, and
event detection.
Weiming Hu received the Ph.D. degree from the
Department of Computer Science and Engineering,
Zhejiang University, in 1998. From 1998 to 2000, he
was a Post-Doctoral Research Fellow with the Insti-
tute of Computer Science and Technology, Peking
University. Currently, he is a Professor in the Insti-
tute of Automation, Chinese Academy of Sciences.
His research interests include visual surveillance and
filtering of Internet objectionable information.
WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 581
Haibin Ling received the B.S. degree in mathe-
matics and the M.S. degree in computer science
from Peking University, China, in 1997 and 2000,
respectively, and the Ph.D. degree from the Univer-
sity of Maryland, College Park, in computer science
in 2006. From 2000 to 2001, he was an Assistant
Researcher at Microsoft Research Asia. From 2006
to 2007, he worked as a Post-Doctoral Scientist at
the University of California Los Angeles. He joined
Siemens Corporate Research as a Research Scientist.
Since 2008, he has been an Assistant Professor at
Temple University. His research interests include computer vision, medical
image analysis, human computer interaction, and machine learning. He
received the Best Student Paper Award at the ACM Symposium on User
Interface Software and Technology in 2003. He is currently an Area Chair
for CVPR 2014.
Wankou Yang received the B.S., M.S., and Ph.D.
degrees from the School of Computer Science and
Technology, Nanjing University of Science and
Technology, China, in 2002, 2004, and 2009, respec-
tively. He is currently an Assistant Professor with
School of Automation, Southeast University. His
research interests include pattern recognition, com-
puter vision, and digital machine learning.
Changyin Sun is a Professor in the School
of Automation, Southeast University, China. He
received the M.S. and Ph.D. degrees in electri-
cal engineering from Southeast University, Nan-
jing, China, in 2001 and 2003, respectively. His
research interests include intelligent control, neural
networks, SVM, pattern recognition, and optimal
theory. He received the First Prize of Nature Science
of Ministry of Education, China. He is an Associate
Editor of the IEEE TRANSACTIONS ON NEURAL
NETWORKS, Neural Processing Letters, and the
International Journal of Swarm Intelligence Research, Recent Patents on
Computer Science.

Weitere ähnliche Inhalte

Was ist angesagt?

Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
ijtsrd
 

Was ist angesagt? (19)

META-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVAL
META-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVALMETA-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVAL
META-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVAL
 
­­­­Cursive Handwriting Recognition System using Feature Extraction and Artif...
­­­­Cursive Handwriting Recognition System using Feature Extraction and Artif...­­­­Cursive Handwriting Recognition System using Feature Extraction and Artif...
­­­­Cursive Handwriting Recognition System using Feature Extraction and Artif...
 
Uniform and non uniform single image deblurring based on sparse representatio...
Uniform and non uniform single image deblurring based on sparse representatio...Uniform and non uniform single image deblurring based on sparse representatio...
Uniform and non uniform single image deblurring based on sparse representatio...
 
Representation and recognition of handwirten digits using deformable templates
Representation and recognition of handwirten digits using deformable templatesRepresentation and recognition of handwirten digits using deformable templates
Representation and recognition of handwirten digits using deformable templates
 
Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes
Kandemir Inferring Object Relevance From Gaze In Dynamic ScenesKandemir Inferring Object Relevance From Gaze In Dynamic Scenes
Kandemir Inferring Object Relevance From Gaze In Dynamic Scenes
 
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
Face Emotion Analysis Using Gabor Features In Image Database for Crime Invest...
 
Optimally Learnt, Neural Network Based Autonomous Mobile Robot Navigation System
Optimally Learnt, Neural Network Based Autonomous Mobile Robot Navigation SystemOptimally Learnt, Neural Network Based Autonomous Mobile Robot Navigation System
Optimally Learnt, Neural Network Based Autonomous Mobile Robot Navigation System
 
Neural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabNeural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlab
 
H0114857
H0114857H0114857
H0114857
 
Facial expression recognition
Facial expression recognitionFacial expression recognition
Facial expression recognition
 
Unsupervised Categorization of Objects into Artificial and Natural Superordin...
Unsupervised Categorization of Objects into Artificial and Natural Superordin...Unsupervised Categorization of Objects into Artificial and Natural Superordin...
Unsupervised Categorization of Objects into Artificial and Natural Superordin...
 
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIERHANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
HANDWRITTEN DIGIT RECOGNITION USING k-NN CLASSIFIER
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an Object
 
MultiModal Identification System in Monozygotic Twins
MultiModal Identification System in Monozygotic TwinsMultiModal Identification System in Monozygotic Twins
MultiModal Identification System in Monozygotic Twins
 
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual DictionaryWeb Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
 
Facial Expression Recognition System using Deep Convolutional Neural Networks.
Facial Expression Recognition  System using Deep Convolutional Neural Networks.Facial Expression Recognition  System using Deep Convolutional Neural Networks.
Facial Expression Recognition System using Deep Convolutional Neural Networks.
 
10.1.1.432.9149
10.1.1.432.914910.1.1.432.9149
10.1.1.432.9149
 

Ähnlich wie Action Recognition using Nonnegative Action

A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal Features
CSCJournals
 
proposal_pura
proposal_puraproposal_pura
proposal_pura
Erick Lin
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
ijsrd.com
 
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
ijitjournal
 

Ähnlich wie Action Recognition using Nonnegative Action (20)

Temporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity RecognitionTemporal Reasoning Graph for Activity Recognition
Temporal Reasoning Graph for Activity Recognition
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
 
Afmkl
AfmklAfmkl
Afmkl
 
F010433136
F010433136F010433136
F010433136
 
An ontology for semantic modelling of virtual world
An ontology for semantic modelling of virtual worldAn ontology for semantic modelling of virtual world
An ontology for semantic modelling of virtual world
 
A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal Features
 
Wearable sensor-based human activity recognition with ensemble learning: a co...
Wearable sensor-based human activity recognition with ensemble learning: a co...Wearable sensor-based human activity recognition with ensemble learning: a co...
Wearable sensor-based human activity recognition with ensemble learning: a co...
 
CHARACTERIZING HUMAN BEHAVIOURS USING STATISTICAL MOTION DESCRIPTOR
CHARACTERIZING HUMAN BEHAVIOURS USING STATISTICAL MOTION DESCRIPTORCHARACTERIZING HUMAN BEHAVIOURS USING STATISTICAL MOTION DESCRIPTOR
CHARACTERIZING HUMAN BEHAVIOURS USING STATISTICAL MOTION DESCRIPTOR
 
Human Action Recognition Using Deep Learning
Human Action Recognition Using Deep LearningHuman Action Recognition Using Deep Learning
Human Action Recognition Using Deep Learning
 
IRJET- Recognition of Human Action Interaction using Motion History Image
IRJET-  	  Recognition of Human Action Interaction using Motion History ImageIRJET-  	  Recognition of Human Action Interaction using Motion History Image
IRJET- Recognition of Human Action Interaction using Motion History Image
 
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORYACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
 
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORYACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
ACTIVITY RECOGNITION USING HISTOGRAM OF ORIENTED GRADIENT PATTERN HISTORY
 
A Survey On Tracking Moving Objects Using Various Algorithms
A Survey On Tracking Moving Objects Using Various AlgorithmsA Survey On Tracking Moving Objects Using Various Algorithms
A Survey On Tracking Moving Objects Using Various Algorithms
 
proposal_pura
proposal_puraproposal_pura
proposal_pura
 
VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...
VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...
VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
ROBUST STATISTICAL APPROACH FOR EXTRACTION OF MOVING HUMAN SILHOUETTES FROM V...
 
Analyzing And Predicting Focus Of Attention In Remote Collaborative Tasks
Analyzing And Predicting Focus Of Attention In Remote Collaborative TasksAnalyzing And Predicting Focus Of Attention In Remote Collaborative Tasks
Analyzing And Predicting Focus Of Attention In Remote Collaborative Tasks
 
GROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCEGROUPING OBJECTS BASED ON THEIR APPEARANCE
GROUPING OBJECTS BASED ON THEIR APPEARANCE
 

Mehr von suthi

EDGE COMPUTING: VISION AND CHALLENGES
EDGE COMPUTING: VISION AND CHALLENGESEDGE COMPUTING: VISION AND CHALLENGES
EDGE COMPUTING: VISION AND CHALLENGES
suthi
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
suthi
 

Mehr von suthi (20)

Object Oriented Programming -- Dr Robert Harle
Object Oriented Programming -- Dr Robert HarleObject Oriented Programming -- Dr Robert Harle
Object Oriented Programming -- Dr Robert Harle
 
THE ROLE OF EDGE COMPUTING IN INTERNET OF THINGS
THE ROLE OF EDGE COMPUTING IN INTERNET OF THINGSTHE ROLE OF EDGE COMPUTING IN INTERNET OF THINGS
THE ROLE OF EDGE COMPUTING IN INTERNET OF THINGS
 
EDGE COMPUTING: VISION AND CHALLENGES
EDGE COMPUTING: VISION AND CHALLENGESEDGE COMPUTING: VISION AND CHALLENGES
EDGE COMPUTING: VISION AND CHALLENGES
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
AUTOMATA THEORY - SHORT NOTES
AUTOMATA THEORY - SHORT NOTESAUTOMATA THEORY - SHORT NOTES
AUTOMATA THEORY - SHORT NOTES
 
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTESOBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
OBJECT ORIENTED PROGRAMMING LANGUAGE - SHORT NOTES
 
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTESPARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
PARALLEL ARCHITECTURE AND COMPUTING - SHORT NOTES
 
SOFTWARE QUALITY ASSURANCE AND TESTING - SHORT NOTES
SOFTWARE QUALITY ASSURANCE AND TESTING - SHORT NOTESSOFTWARE QUALITY ASSURANCE AND TESTING - SHORT NOTES
SOFTWARE QUALITY ASSURANCE AND TESTING - SHORT NOTES
 
COMPUTER HARDWARE - SHORT NOTES
COMPUTER HARDWARE - SHORT NOTESCOMPUTER HARDWARE - SHORT NOTES
COMPUTER HARDWARE - SHORT NOTES
 
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTESDATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
 
OPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTESOPERATING SYSTEM - SHORT NOTES
OPERATING SYSTEM - SHORT NOTES
 
SOFTWARE ENGINEERING & ARCHITECTURE - SHORT NOTES
SOFTWARE ENGINEERING & ARCHITECTURE - SHORT NOTESSOFTWARE ENGINEERING & ARCHITECTURE - SHORT NOTES
SOFTWARE ENGINEERING & ARCHITECTURE - SHORT NOTES
 
ALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTESALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTES
 
COMPUTER NETWORKS - SHORT NOTES
COMPUTER NETWORKS - SHORT NOTESCOMPUTER NETWORKS - SHORT NOTES
COMPUTER NETWORKS - SHORT NOTES
 
DATA STRUCTURES - SHORT NOTES
DATA STRUCTURES - SHORT NOTESDATA STRUCTURES - SHORT NOTES
DATA STRUCTURES - SHORT NOTES
 
ARTIFICIAL INTELLIGENCE - SHORT NOTES
ARTIFICIAL INTELLIGENCE - SHORT NOTESARTIFICIAL INTELLIGENCE - SHORT NOTES
ARTIFICIAL INTELLIGENCE - SHORT NOTES
 
LIGHT PEAK
LIGHT PEAKLIGHT PEAK
LIGHT PEAK
 
C Programming Tutorial
C Programming TutorialC Programming Tutorial
C Programming Tutorial
 
Data structure - mcqs
Data structure - mcqsData structure - mcqs
Data structure - mcqs
 
Data base management systems ppt
Data base management systems pptData base management systems ppt
Data base management systems ppt
 

Kürzlich hochgeladen

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Kürzlich hochgeladen (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Action Recognition using Nonnegative Action

  • 1. 570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection Haoran Wang, Chunfeng Yuan, Weiming Hu, Haibin Ling, Wankou Yang, and Changyin Sun Abstract—In this paper, we propose using high-level action units to represent human actions in videos and, based on such units, a novel sparse model is developed for human action recognition. There are three interconnected components in our approach. First, we propose a new context-aware spatial- temporal descriptor, named locally weighted word context, to improve the discriminability of the traditionally used local spatial-temporal descriptors. Second, from the statistics of the context-aware descriptors, we learn action units using the graph regularized nonnegative matrix factorization, which leads to a part-based representation and encodes the geometrical informa- tion. These units effectively bridge the semantic gap in action recognition. Third, we propose a sparse model based on a joint l2,1-norm to preserve the representative items and suppress noise in the action units. Intuitively, when learning the dictionary for action representation, the sparse model captures the fact that actions from the same class share similar units. The proposed approach is evaluated on several publicly available data sets. The experimental results and analysis clearly demonstrate the effectiveness of the proposed approach. Index Terms—Action unit, action recognition, sparse repre- sentation, nonnegative matrix factorization. I. INTRODUCTION HUMAN action recognition has a wide range of appli- cations such as video content analysis, activity sur- veillance, and human-computer interaction [1]. As one of the most active topics in computer vision, much work on human action recognition has been reported [2]–[37]. In most of the traditional approaches for human action recognition, Manuscript received January 20, 2013; revised August 2, 2013 and October 26, 2013; accepted November 12, 2013. Date of publication November 25, 2013; date of current version December 17, 2013. This work was supported in part by the Scientific Research Foundation of Graduate School of Southeast University, in part by the Beijing Natural Science Foun- dation under Grant 4121003, in part by the National 863 High-Tech Research and Development Program of China under Grant 2012AA012504, and in part by the Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dimitrios Tzovaras. H. Wang, W. Yang, and C. Sun are with the School of Automation, South- east University, Nanjing 210096, China (e-mail: whr1fighting@gmail.com; youngwankou@yeah.net; cysun@seu.edu.cn). C. Yuan and W. Hu are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: cfyuan@nlpr.ia.ac.cn; wmhu@nlpr.ia.ac.cn). H. Ling is with the Department of Computer and Information Science, Temple University, Philadelphia, PA 19122 USA (e-mail: hbling@temple.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2013.2292550 Fig. 1. A single interest point may have multiple meanings in different contexts. Some correlated interest points together can construct an action unit which is more descriptive and discriminative. A video sequence can be represented by a few action units, and each action class has its own representative action units. action models are typically constructed from patterns of low- level features such as appearance patterns [27], [28], optical flow [14], [15], space-time templates [16], [17], 2D shape matching [18], [19], trajectory-based representation [22], [23] and bag-of-visual-words (BoVW) [13], [34]. However, these features can hardly characterize rich semantic structure in actions. Inspired by recent development in object classification [38], [39], we introduce a high-level concept named “action unit” to describe human actions, as illustrated in Fig. 1. For example, the “golf-swinging” action contains some representative motions, such as “arm swing” and “torso twist”. They are hardly described by the low-level features mentioned above. On the other hand, some correlated space-time interest points, when combined together, can characterize a representative motion. Moreover, the key frame is important to describe an action; and a key frame may be characterized by the co-occurrence of space-time interest points extracted from the frame. The representative motions and key frames both reflect some action units, which can then be used to represent action classes. With the above observation, we propose using high-level action units for 1057-7149 © 2013 IEEE
  • 2. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 571 Fig. 2. Flowchart of the proposed framework. human actions representation. Typically, from an input human action video, hundreds of interest points are first extracted and then agglomerated into tens of action units, which then compactly represent the video. Such a representation is more discriminative than traditional BoVW model. To use it for action recognition, we address the following three main issues. 1. Selecting low-level features for generating the action unit. Some of the aforementioned features require reliable tracking or body pose estimation, which is hard to achieve in practice. The interest-point-based representation circumvent avoid such requirement while being robust to noise, occlusion and geometric variation. But traditional bag-of-visual-words models (BoVW) utilize only features from individual interest points and ignore spatial-temporal context information. To address this issue, we propose a new context-aware descriptor that incorporates context information from neighboring interest points. This way, the new descriptor is more discriminative and robust than the traditional BoVW. 2. Building an action unit set to represent all action classes under investigation. Nonnegative Matrix Factorization (NMF) [43] has received considerable attention and has been shown to capture part-based representation in the human brain [40], [41] as well as vision tasks [42], [43]. We propose using graph regularized Nonnegative Matrix Factorization (GNMF) [44] to encode the geometrical information by constructing a nearest neighbor graph. It finds a part-based representation in which two data points are connected if they are sufficiently close to each other. The GNMF-based action units are automatically learned from the training data and are capable of capturing the intra-class variation of each action class. 3. Choosing discriminative action units and suppressing noise in action classes. We propose a new action unit selection method based on a joint l2,1-norm minimization. We first introduce the l2,1-norm for vectors. Sparse model based on such norm is robust to outliers and the regularization can guide selecting action units across intra-class samples. The dictionary learning process captures the fact that actions from the same class share similar action units. In this work we target learning high-level action units to represent and classify human actions. For this purpose, we improve over the traditional interest point feature and propose an action unit based solution, which is further improved by an action unit selection procedure. Fig. 2 illustrates the flowchart of our framework. In summary, the training phase learns the model for action units and the action classifier on the action unit-based representation. The testing phase uses the learned model for action prediction. In the rest of the paper, Sec. 2 reviews the related work. Sec. 3 introduces the new context-aware descriptor as the low- level feature. Sec. 4 presents the GNMF-based action unit as the high-level feature. Sec. 5 proposes the joint l2,1-norm based action unit selection and a supervised dictionary learning method for classification. Sec. 6 demonstrates the experimental results. Sec. 7 concludes this paper. II. RELATED WORK Action recognition has been widely explored in the com- puter vision community. Recently, some attempts have been made to use the mid- or high-level concepts for human action recognition. Liu et al. [34] exploit mutual informa- tion maximization techniques to learn a compact mid-level codebook and use the spatial-temporal pyramid to exploit temporal information. Fathi et al. [48] extract discriminative flow features within overlapping space-time cells and select mid-level features via AdaBoost. Unfortunately, the global binning makes the representation sensitive to position or time shifts in the clip segmentation, and using predetermined fixed-size spatial-temporal grid bins assumes that the proper volume scale is known and uniform across action classes. Such uniformity is not inherent in the features themselves, given the large differences between the spatial-temporal distributions of the features for different activities. Wang et al. [45] use the hidden conditional random fields for action recognition. The authors model an action class as a root template and a constellation of hidden “parts”, where the hidden “part” is a group of local patches that are implicitly correlated with some intermediate representation. Ramanan et al. [46] first track the persons with 3D pose estimation, which is then used for action recognition. Liu et al. [47] use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized mid-level features, each represented by the vector of pointwise mutual information. But the vocabularies are created for individual categories, thus they are not univer- sal and general enough, which limits their applications. Liu et al. [6] learn data-driven attributes as the latent variables. The authors augment the manually-specified attributes with the automatically learned attributes to provide a complete charac- terization of human actions. Compared with traditional low- level features [13], [55]–[57], it is obvious that human actions are more effectively represented by considering multiple high- level semantic concepts. However, the current learning-based visual representations obtain labels from the entire video, and hence include background clutters which may degenerate learning effectiveness. High-level concepts have also been applied to object recog- nition. Farhadi et al. [39] use a set of semantic attributes such as ‘hairy’ and ‘four-legged’ to identify familiar objects, and to describe unfamiliar objects when images and bounding box annotations are provided. Lampert et al. [50] show that high- level descriptors in terms of semantic attributes can be used
  • 3. 572 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 to recognize object classes without any training image, once semantic attribute classifiers are trained from other classes of data. Vogel et al. [51] use attributes related to the scenes to characterize image regions and combine these local semantic attributes to form a global image description for natural scene retrieval. Wang et al. [52] propose to represent an image by its similarities to Flickr image groups which have explicit semantic meanings. Classifiers are learned to predict the membership of images to Flickr groups, and the class mem- bership probabilities are used to define the image similarity. Li et al. [53] build a semantically meaningful image hierarchy by using both visual and semantic information, and represent images by the estimated distributions of concepts over the entire hierarchy. Torresani et al. [54] use the outputs of a large number of object category classifiers to represent images. Dictionary learning has been proven to be effective to select discriminative low-level features for classification. Liu et al. [3] use PageRank to mine the most informa- tive static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information- theoretic algorithm is employed to group semantically related features. Brendel et al. [63] store multiple diverse exemplars per activity class, and learn a sparse dictionary of most discriminative exemplars. Qiu et al. [68] propose a Gaussian process model to optimize the dictionary objective function. The dictionary learning algorithm is based on sparse rep- resentation which has recently received a lot of attention. Mairal et al. [69] generalize the reconstructive sparse dictio- nary learning process by optimizing the sparse reconstruction jointly with a linear prediction model. Bradley and Bagnell [70] propose a novel differentiable sparse prior rather than the conventional L1 norm, and employ a back propagation procedure to train the dictionary for sparse coding in order to minimize the training error. These approaches need to explicitly associate each sample with a label in order to perform the supervised training. They aim at learning dis- criminative sparse models instead of purely reconstructive ones. III. LOCALLY WEIGHTED WORD CONTEXT DESCRIPTOR We propose a new context-aware descriptor called locally weighted word context (LWWC) as the low-level descriptor. LWWC encodes spatial context information rather than being limited to a single interest point as used in traditional interest- point-based descriptors. Such spatial context information is extracted from neighboring of interest points, and can be used to improve the robustness and discriminability of the proposed descriptor. We first perform the space-time interest point detection use the methods in [12] and [13] for different data sets. These interest points are initially described by histograms-of-optical- flow (HOF) and histograms-of-oriented-gradients (HOG), which respectively characterize the motion and appearance within a volume surrounding an interest point. Afterwards, we employ the k-means algorithm on these features to create a vocabulary of size K. Following the BoVW, each interest point is then converted to a visual word. Finally, for each Fig. 3. Illustration of the locally weighted word context descriptor. (a) shows the structure of this descriptor. LWWC is constructed by several neighboring interest points together. “Dσ (p, qj )” is the distance between central interest point and neighboring interest point. (b) shows the representation for the central interest point by a K-by-1 vector applied to each local feature. interest point together with its N − 1 nearest interest points, the Locally Weighted Word Context descriptor (LWWC) is calculated as following. Let N(p) = {p, q1, · · · , qN−1} denote an interest point p and its N − 1 nearest neighboring points. The N − 1 nearest neighboring points are collected according to the normal- ized Euclidean distance on their 3D position spatial-temporal coordinates: Dσ (p, q) = i=1,2,3 1 σi (p(i) − q(i))2 1/2 , (1) where the three components in p and q record the horizontal, vertical and temporal positions of the interest points respec- tively; and σi is the corresponding weight. Using the BoVW model, each point in N(p) is assigned to a visual word. Therefore, the LWWC of N(p) is defined as a vector of size K × 1 where K is the size of the vocabulary. The k-th element element in the LWWC vector is inversely proportional to the distance between the p and the corresponding points in N. If multiple points in N belong to the same visual word, their responses are summed together. Specifically, we set the value to 1 corresponding to p, and β Dσ (p,qj ) for qj ∈ N. As shown in Fig. 3, the LWWC for N(p) is denoted as follows: Fp = [h1, h2, · · · , hK ]T , (2) hi = 1 if label(p) = i N−1 j=1 β · δ(label(qj )−i) Dσ (p,qj ) otherwise, (3) where label(p) denotes the word id of point p. Through rebuilding the BOW model on the LWWC descriptors, each action video is represented by a vector based on low-level features. IV. GNMF-BASED ACTION UNITS NMF is a factorization algorithm for analyzing nonnegative matrices. The nonnegative constraints allow only additive
  • 4. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 573 Fig. 4. GNMF-based action units. (a) A video from class “walking” includes two action units “translation of torso” and “leg motion”. (b) A video from class “waving” includes two identical action units “arm motion”. combinations of different bases. This is the most signifi- cant difference between NMF and other matrix factorization methods such as singular vector decomposition (SVD). NMF can learn a part-based representation. But it fails to discover the intrinsic geometrical and discriminative structure of the data space, which is essential to the real-world applications. If two data points are close in the intrinsic geometry of the data distribution, the representations of these two points with respect to the new bases should be still close to each other. GNMF [44] aims to solve this problem. Most previous works represent actions with low-level features. In this paper, we propose to extract high-level action units based on GNMF to better describe the human actions. The GNMF-based action units are generated as follows. Let y j i ∈ Rd, i = 1, · · · , C, j = 1, · · · , ni denote the d-dimensional low-level feature representation of the j-th video in class i. All such representations in class i form a matrix Yi = [y1 i , · · · , yni i ] ∈ Rd×ni . GNMF minimizes the following objective function: Q = Yi − UV T 2 F + λTrace(V T LV ) , (4) where U ∈ Rd×si and V ∈ Rni ×si are two nonnegative matrices, L = D − W the called graph Laplacian [62], W the symmetric and nonnegative similarity matrix. We adopt the heat kernel weight Wjl = e− 1 δ y j i −yl i 2 . D is a diagonal matrix whose entries are column (or row, since W is symmetric) sums of W. Each column of matrix V T is a low-dimensional representation of the corresponding column of Yi with respect to the new bases. We define the column vectors of matrix U as the action units belonging to action class i. Each element of the column vector corresponds to a visual word which is obtained from the low-level descriptors. Each column vector of matrix U is a semantic representation constructed by several correlated visual words. Repeating the same process for each action class, we obtain all the class-specific action units. As an example, suppose we have four action units forming the bases: “translation of torso”, “up-down torso motion”, “arm motion”, and “leg motion”. Then the action class “walking” may be represented by an action-unit-based vec- tor [1, 0, 1, 1] ∈ R4, and “waving” may be represented by vector [0, 0, 2, 0] ∈ R4, with each dimension indicat- ing the degree of the corresponding action unit as shown in Fig. 4. The action-unit-based representation has two main advan- tages. First, it is compact since only tens of action units to describe an action video. This is more efficient in than BoVW models where hundreds of interest points are needed. Second, some low-level local features are not discriminative, and even have negative influence on classification. The process of learning class-specific action units can suppress such noises. The matrix factorization algorithm extracts the representative action units for each action class. The representative action units should exist in all the videos belonging to the same action class. Some low-level local features that only exist in a few intra-class videos are suppressed by the algorithm mentioned above, and are not used for constructing the high- level action units. The learned class-specific action units can exhibit the characteristic of each action class. So, the proposed action-unit-based representation is more powerful for classification. V. ROBUST ACTION UNIT SELECTION BASED ON JOINT l2,1-NORMS Recently sparse representation has received a lot of attention in computer vision. Typically, it approximates the input signal in terms of a sparse linear combination of the given over- complete bases in dictionary. Such sparse representations are usually derived by linear programming as an l1-norm minimization problem. But the l1-norm based regularization is sensitive to outliers. Inspired by the l2,1-norm of a matrix [58], we first introduce the l2,1-norm of a vector. Moreover, we propose a new joint l2,1-norm based sparse model to select the representative action units for each action class. The proposed sparse model mainly has two advantages for classification- based action unit selection. First, the l2,1-norm of the matrix in our sparse model encourages that the samples from the same action class are constructed by similar action units, and the action units which only appear in several intra-class samples are suppressed. Second, each action class has its own representative action units. The l2,1-norm of the vector in our sparse model encourages each sample is constructed by the action units from the same class. A. Notations and Definitions We first introduce the notations and the definition of norms. For a matrix Z, its j-th row, k-th column are denoted by Z j·, Z·k respectively, and z jk is an element in the j-th row and the k-th column. The l2,1-norm of a matrix is introduced in [58] as a rotational invariant l1-norm and also used for multi-task learning [59], [60] and tensor factorization [61]. It is defined as: Z 2,1 = n j=1 z j· 2 = n j=1 r k=1 z2 jk 1 2 . (5) We first introduce the l2,1-norm of a vector. For a vector b = [b1, b2, · · · , bn]T , the elements are divided into G groups according to some rules, and the number of elements in group g is mg, b = [b11, · · · , b1m1 , · · · , bg1, · · · , bgmg , · · · , bG1, · · · , bGmG ]T . The l2,1-norm of the vector b is defined as: b 2,1 = G g=1 mg k=1 b2 gk 1 2 . (6)
  • 5. 574 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 B. Problem Formulation Following the notation in Sec. IV and assume that the i-the action class has mi learned action units, and C i=1 mi = m. We initialize the dictionary B = [B1, B2, . . . , BC ] such that Bi = [bi1, . . . , bimi ], where bij denotes the j-th action unit of the i-th class. The proposed sparse model for action-unit- selection is min B,Xi C i=1 Yi − BXi 2 F + γ1 k Xi ·k 2,1+γ2 Xi 2,1 , (7) where · F denotes the Frobenius-norm and · 2,1 the l2,1-norm. Xi ·k 2,1 penalizes each group of elements, corresponding to the columns of Bi, in vector Xi ·k as a whole and enforces sparsity among the groups, so it encourages that each action video is constructed by the action units from the same class. Xi 2,1 penalizes each row of matrix Xi as a whole and enforces sparsity among the rows, so it encourages the videos from the same action class are constructed by sim- ilar action units. γ1 and γ2 are the regularization parameters. For classification, with a sparse model φ(yt , B), a predictive model f (xt) = f (φ(yt, B)), a class label lt of the action video yt, and a classification loss L(lt , f (xt ))= lt − f (xt) 2 2, we desire to train the whole system with respect to the dictionary B given P training samples. min B E = min B P t=1 L(lt, f (φ(yt , B))). (8) The dictionary optimization is carried out using an iterative approach composed of two steps: the sparse coding step on a fixed B and the dictionary update step on fixed Xi . Step 1: Sparse coding. Taking derivatives with respect to Xi ·k(1 ≤ k ≤ ni ) according to (7) and setting it to zero, we have 2BT BXi ·k − 2BT Yi ·k + γ1Dk Xi ·k + γ2E Xi ·k = 0 , (9) where E = diag 1 Xi 1· 2 , 1 Xi 2· 2 , · · · , 1 Xi m· 2 , (10) Dk = diag(w1I(m1), w2I(m2), · · · , wC I(mC )) , (11) where wj = Mj p=1+Mj −m j (Xi p,k)2 −1/2 , Mj = j l=1 ml, I(m j ) is the m j × m j identity matrix, and diag (·) denotes a diagonal matrix formed from elements of the vector. From Equ. (9), we get Xi ·k = 2(2BT B + γ1Dk + γ2 E)−1 BT Yi ·k . (12) Note that Dk and E depend on Xi and thus are also unknown variables. An iterative algorithm is proposed to solve this problem, which is illustrated in Algorithm 1. Step 2: Dictionary updating. Minimizing the loss func- tion E over B will tighten the learned dictionary with the classification model, and therefore improve the classification Algorithm 1 An iterative algorithm for sparse coding effectiveness. We compute the gradient of E with respect to B according to Equ. (8): ∇B E = P t=1 ∇B L = P t=1 ∇ f L ·∇B f = P t=1 ∇ f L ·∇xt f ·∇Bxt . (13) Therefore, the problem is reduced to compute the gradient of the sparse representation xt with respect to the dictionary B. According to Equ. (12), we have ∇B Xi ·k = (2BT B + γ1Dk + γ2 E)−1 · 2∇B(BT Yi ·k)−(2∇B(BT B)+γ1∇BDk +γ2∇BE)Xi ·k . (14) Through the above process, we obtain the optimized dictionary. For a test sample y, we get the action-unit-based represen- tation x by min y − Bx 2 F + γ1 x 2,1 . (15) SVM is adopted as the predictive model for discriminative dictionary learning and classification, and we employ the generalized Gaussian kernel with the χ2 distance kernel, i.e., K(Hi, Hj) = exp − 1 A χ2 (Hi, Hj) , (16) for two histograms Hi, Hj, where A is the scale parameter set as the mean distance between training samples. VI. EXPERIMENTAL RESULTS A. Data Sets Five action data sets are used in our evaluation: the KTH action data set [31], the UCF Sports data set [10], the UT-Interaction data set [21], the UCF YouTube data set [3], and the Hollywood2 data set [73]. Examples of these data sets are shown in Fig. 5. The KTH data set contains six single person actions (boxing, hand waving, hand clapping, walking, jogging, and running) performed repeatedly by 25 persons in four different scenarios: outdoors, outdoors with camera zoom, outdoors with different clothes, and indoors. The UCF Sports is a challenging data set collecting a large set of action clips from various broadcast sport videos. Actions include diving, golf swinging, kicking, lifting, horseback riding, running, skating, swinging, and walking. The actions are captured in a wide range of scenes and view points.
  • 6. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 575 Fig. 5. Representative frames from videos in five data sets. From top to bottom: the KTH data set, the UCF Sports data set, the UT-Interaction data set, the UCF YouTube data set, and the Hollywood2 data set. The UT-Interaction data set has been used in the first Contest on Semantic Description of Human Activities [21]. This data set contains action sequences of six interactions: hug, kick, point, punch, push, and hand-shake. For classifica- tion, 120 video segments cropped based on the ground-truth bounding boxes and time intervals are provided by the data set organizers. These segments are further divided into two sets, and each set has 60 segments with 10 segments per class. Set 1 is captured at a parking lot and set 2 at a lawn. The UCF YouTube data set contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog. This data set is challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background and illumination conditions. The Hollywood2 data set has been collected from 69 differ- ent Hollywood movies. There are 12 action classes: answering the phone, driving car, eating, flighting, getting out of car, hand shaking, hugging, kissing, running, sitting down, sitting up, and standing up. In our experiments, we use the clean training data set. In total, there are 1707 action samples divided into a training set and a test set. Train and test sequences come from different movies. For the KTH and UT-Interaction data sets, the Harris3D detector [13] is used for interest point extraction, and the cuboid detector [12] is adopted for the other data sets. B. Effects of the LWWC Descriptor In our approach, we adopt the LWWC descriptor at low level. Different from traditional interest-point-based methods, the proposed descriptor contains the information of an area, rather than a single interest point, by describing the distribution of neighboring interest points. Experiments are conducted to evaluate the influence of the neighborhood information in the LWWC descriptor. Fig. 6 illustrates the recognition rates corresponding to different scales of neighborhood information covered by the proposed descriptor on the KTH and the challenging UT-Interaction data sets. Traditional interest point based methods only utilize features of a single interest point. It can only describe a very small area. So the accuracy can be easily influenced by noise. The recognition rate is 93.99% on the KTH data set. When Fig. 6. Performance of LWWC on the KTH and UT-Interaction data sets. Fig. 7. Performance of action unit selection on KTH and UT-Interaction. the features of neighboring interest points are involved, the proposed descriptor describes a larger area, and makes use of more neighborhood information. So, the recognition rate is raised to 95.49% when 8 nearest neighboring interest points are collected for the LWWC descriptor. To further validate the effectiveness of our descriptor, we also conduct similar experiments on the UT-Interaction data set. In set 1, when we adopt the feature of a single interest point, the accuracy is only 78.3%. The accuracy is raised to 81.7% when 4 nearest neighboring interest points are collected. In set 2, the accuracy is raised to 80.0% from 68.3% when 8 nearest neighboring interest points are collected. These experimental results show that the recognition rate is improved when the neighborhood information is incorporated into the LWWC descriptor. C. Analysis of the Action Unit Selection Based on low-level descriptors, the action units are learned through GNMF. Among the learned action units, our proposed joint l2,1-norm based sparse model aims at selecting the class- specific representative action units to improve the recognition performance. It encourages that the actions from the same class are described by the same action units, and each action is described by the action units from the same class. To evaluate action unit selection, we compare the perfor- mances of the original GNMF-based action units with the one using action unit selection, as illustrated in Fig. 7. On the KTH data set, the action unit selection significantly boosts the performance from 92.65% to 95.49%. On the UT-Interaction data set, it again significantly boost the recognition accuracies from 80.0% to 81.7% (on Set 1) and from 70.0% to 80.0% (on Set 2). These results clearly validate that the proposed joint l2,1-norm based action unit selection method is effective to improve the recognition performance.
  • 7. 576 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 Fig. 8. Confusion matrix of the classification on the KTH data set. TABLE I COMPARISON WITH PREVIOUS METHODS ON THE KTH DATA SET D. Experiments on the KTH Data Set Consistent with the experiment setting used in [13], [31], [34], we test the proposed approach on the entire KTH data set [31], in which videos of four scenarios are mixed together. We split the data set into a training part with 24 persons’ videos and a test part with the remaining videos. The final result is the average of 25 times runs. For the sparse model based action unit selection, we set the tradeoff parameters γ1 = γ2 = 0.2. Fig. 8 presents the confusion matrix of the classification on the KTH data set. It shows that our approach works excellently on most actions such as “hand waving” and “boxing”. The main confusion occurs between “jogging” and “running”, since the actions performed by some actors are very similar. Table I lists the average accuracies of our method and other recently proposed ones. It shows that our method achieves excellent performance (95.49%), which is comparable to the best reported results. The experimental results validate the effectiveness of the proposed method. Furthermore, we compare the performances of different baseline approaches (such as traditional single interest point feature, the proposed LWWC descriptor, l1-norm based sparse model, and our action unit selection approach), and study the contribution of each part in our method, as illustrated in Table II. The accuracy of traditional single TABLE II CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO CLASSIFICATION ON THE KTH DATA SET TABLE III CLASSIFICATION ACCURACIES ON THE KTH DATA SET interest point feature is 91.80%. If we only use the LWWC descriptor, the accuracy is 92.65%. When we only adopt the action unit selection based on the traditional single interest point feature, the accuracy is 93.99%. Combining both, the accuracy reaches 95.49%. When we use the l1-norm based on LWWC descriptors, the accuracy is 92.82%. The study demonstrates that each of the proposed approaches offers more discriminative power than the BoVW baseline, and the l2,1-norm based action unit selection approach obtains better performance than the l1-norm based sparse model. It further validates the effectiveness of the high-level descriptor for classification. Our method, which combines the low-level LWWC descriptor with the high-level action unit selection, achieves the best performance. Furthermore, our method is compared with some basic cases, as shown in Table III. The comparisons validate that the proposed method improves the performance of traditional methods. E. Experiments on the UCF Sports Data Set Most previously reported results on this data set use leave- one-out (LOO) manner, cycling each example as the test video one at a time. But Lan et al. [76] propose to split the data set into disjoint training and testing sets to avoid the background regularity for evaluation. We report our results in both protocols. For the realistic data set, we perform dense and multi-scale interest point extraction. To generate the codebook, we empirically set the codebook size k to 1000. For the sparse model based action unit selection, we set the tradeoff parameters γ1 = γ2 = 0.2. Fig. 9 presents the confusion matrix across all scenarios in the leave-one-out protocol on the UCF Sports data set. Our method works well on most actions. For example, the recognition accuracies for some actions are high up to 100%, such as “diving” and “lifting”. There are complex backgrounds in this data set, and some actions are very similar and challenging for recognition, such as “golfing”, “horseback riding”, and “running”. We conduct further experiments on
  • 8. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 577 Fig. 9. Confusion matrix on the UCF Sports data set (LOO). TABLE IV CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO CLASSIFICATION ON THE UCF SPORTS DATA SET (SPLIT) TABLE V COMPARISON WITH PREVIOUS METHODS ON THE UCF SPORTS DATA SET the UCF Sports data set to study different components of the proposed approach, similar as on the KTH data set. Table IV illustrates the comparison of the two proposed approaches with their combination in Lan’s manner (Split). The effec- tiveness of each proposed ingredients is again confirmed as the results on the ETH data set. Table V compares the overall mean accuracy of our method with the results reported by previous researchers. Our average recognition accuracy is competitive with most reported results except the action bank method [74]. F. Experiments on the UT-Interaction Data Set The action videos in UT-Interaction are divided into two sets. To generate the codebook, we empirically set the code- book size k to 500 in set 1, and set the codebook size Fig. 10. Confusion matrices on the UT-Interaction data set. (a) Set 1. (b) Set 2. TABLE VI CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO CLASSIFICATION ON THE UT-INTERACTION DATA SET TABLE VII COMPARISON WITH PREVIOUS METHODS ON THE UT-INTERACTION DATA SET. THE THIRD AND FOURTH COLUMNS REPORT THE RECOGNITION RATE USING THE FIRST HALF AND THE ENTIRE VIDEOS RESPECTIVELY k to 300 in set 2. We set the tradeoff parameters γ1 = γ2 = 0.2 for both sets. We perform the leave-one-out test strategy. Fig. 10 presents the confusion matrices across all scenarios in set 1 and set 2. The recognition accuracies for some actions are excellent, such as “point” in set 2 and “push”. Table VI illustrates the comparison of the two proposed approaches with their combination in both set 1 and set 2. Each approach offers more discriminative power than traditional single interest point feature. The LWWC descriptor performs better than the action unit selection method in both sets, and the combina- tion of them provides the best performance. The l2,1-norm based action selection method outperforms the l1-norm based sparse model. Also, the result of our method is compared with those reported by previous researchers in Table VII, and some basic cases in Table VIII. The proposed method
  • 9. 578 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 TABLE VIII CLASSIFICATION ACCURACIES OF DIFFERENT METHODS ON THE UT-INTERACTION DATA SET (SET 1 AND SET 2) Fig. 11. Confusion matrix on the UCF YouTube data set. Fig. 12. Performance comparisons among some relative interest-point-based approaches on the UCF YouTube Data Set. outperforms most other methods and achieves a competitive result. G. Experiments on the UCF YouTube Data Set We follow the original setup [3] using leave one out cross validation for a pre-defined set of 25 folds, and perform dense interest point extraction. The codebook size is set to 1000, and the tradeoff parameters γ1 = γ2 = 0.3. Fig. 11 presents the confusion matrix across all scenarios on the UCF YouTube data set. In Fig. 12, we compare per action class performances of relative methods including cuboid, TABLE IX ACCURACIES ON THE UCF YOUTUBE DATA SET Fig. 13. Performance of different versions of the proposed approach on the Hollywood2 data set. TABLE X COMPARISON WITH PREVIOUS METHODS ON THE HOLLYWOOD2 DATA SET LWWC, action unit selection and some previously reported interest-point-based methods, and our method outperforms others. Table IX compares the overall mean accuracy of our method with the reported by previous researchers. Our average recognition accuracy is 82.2%, which is comparable to the state-of-the-art performance and outperforms other interest- point-based methods. H. Experiments on the Hollywood2 Data Set Similar to the parameters used in the YouTube data set, the codebook size k is empirically set to 1000, and we set the tradeoff parameters γ1 = γ2 = 0.3. Fig. 13 presents the performance of our method, and tests the contribution of each part in our method to the recognition accuracy respectively. The accuracy of traditional single interest point feature is 47.9%. If we only utilize the LWWC descriptor, the accuracy is 50.1%. When we only adopt the action unit selection based on the traditional single interest point feature, the accuracy is 54.5%. In combination, the accuracy of our method is 56.8%. Table X compares the overall mean accuracy of our method with the results reported by previous researchers. Our average recognition accuracy is better than or comparable to the state-of-the-art performances.
  • 10. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 579 VII. CONCLUSION In this paper, we have proposed to represent human actions by a set of intermediate concepts called action units which are automatically learned from the training data. At low level, we have presented a locally weighted word context descriptor to improve the traditional interest-point-based representation. The proposed descriptor incorporates the neighborhood infor- mation effectively. At high level, we have introduced the GNMF-based action units to bridge the semantic gap in action representation. Furthermore, we have proposed a new joint l2,1-norm based sparse model for action unit selection in a discriminative fashion. Extensive experiments have been carried out to validate our claims and have confirmed our intuition that the action unit based representation is critical for modeling complex activities from videos. REFERENCES [1] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1473–1488, Sep. 2008. [2] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” in Proc. Int. J. Comput. Vis., vol. 79, no. 3, pp. 299–318, Sep. 2008. [3] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos ‘in the wild’,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1996–2003. [4] J. Niebles, C. Chen, and L. Fei-Fei, “Modeling temporal structure of decomposable motion segments for activity classification,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 392–405. [5] H. Wang, C. Yuan, W. Hu, and C. Sun, “Supervised class-specific dictionary learning for sparse modeling in action recognition,” Pattern Recognit., vol. 45, no. 11, pp. 3902–3911, 2012. [6] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3337–3344. [7] J. Liu, S. Ali, and M. Shah, “Recognizing human actions using multiple features,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [8] K. Rapantzikos, Y. Avrithis, and S. Kollias, “Dense saliency-based spatiotemporal feature points for action recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1454–1461. [9] W. Lee and H. Chen, “Histogram-based interest point detectors,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1590–1596. [10] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio- temporal maximum average correlation height filter for action recogni- tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [11] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in Proc. Brit. Mach. Vis. Conf., 2009, pp. 1–11. [12] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatiotemporal features,” in Proc. 2nd Joint IEEE Int. Work- shop Vis. Surveill. Perform. Eval. Track. Surveill., Oct. 2005, pp. 65–72. [13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [14] S. Ali and M. Shah, “Human action recognition in videos using kinematic features and multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 288–303, Feb. 2010. [15] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 3, pp. 1257–1265, Mar. 2001. [16] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 984–989. [17] M. Blank, M. Irani, and R. Basri, “Actions as space-time shapes,” in Proc. 10th IEEE ICCV, Oct. 2005, pp. 1395–1402. [18] Z. Lin, Z. Jiang, and L. S. Davis, “Recognizing actions by shape- motion prototype trees,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 444–451. [19] F. Lv and R. Nevatia, “Single view human action recognition using key pose matching and viterbi path searching,” in Proc. IEEE Int. Conf. CVPR, Jun. 2007, pp. 1–8. [20] H. Wang, C. Yuan, G. Luo, W. Hu, and C. Sun, “Action recognition using linear dynamic systems,” Pattern Recognit., vol. 46, no. 6, pp. 1710–1718, 2013. [21] M. S. Ryoo and J. K. Aggarwal. (2010). An Overview of Contest on Semantic Description of Human Activities (SDHA) [Online]. Available: http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html [22] M. Raptis and S. Soatto, “Tracklet descriptors for action modeling and video analysis,” in Proc. ECCV, 2010, pp. 577–590. [23] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S. Chua, and J. Li, “Hierarchical spatio-temporal context modeling for action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 2004–2011. [24] M. Bregonzio, S. Gong, and T. Xiang, “Recognising action as clouds of space-time interest points,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 1948–1955. [25] S. F. Wong, T. K. Kim, and R. Cipolla, “Learning motion categories using both semantic and structural information,” in Proc. IEEE Int. Conf. CVPR, Jun. 2007, pp. 1–6. [26] S. Savarese, A. Delpozo, J. Niebles, and L. Fei-Fei, “Spatial-temporal correlations for unsupervised action classification,” in Proc. IEEE Work- shop Motion Video Comput., Jan. 2008, pp. 1–8. [27] I. Kotsia, S. Zafeiriou, and I. Pitas, “Texture and shape information fusion for facial expression and facial action unit recognition,” Pattern Recognit., vol. 41, no. 3, pp. 833–851, 2008. [28] H. Jiang, M. Crew, and Z. Li, “Successive convex matching for action detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2006, pp. 1646–1653. [29] A. Klaser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptor based on 3D-gradients,” in Proc. Brit. Mach. Vis. Conf., 2008, pp. 1–10. [30] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional SIFT descriptor and its application to action recognition,” in Proc. ACM 15th Int. Conf. Multimedia, 2007, pp. 357–360. [31] C. Schuldt, I. Laptive, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proc. IEEE 17th ICPR, vol. 3. Aug. 2004, pp. 32–36. [32] Z. Zhang, Y. Hu, S. Chan, and L. Chia, “Motion context: A new representation for human action recognition,” in Proc. ECCV, 2008, pp. 817–829. [33] Y. Wang and G. Mori, “Human action recognition by semi-latent topic models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp. 1762–1774, Oct. 2009. [34] J. Liu and M. Shah, “Learning human actions via information maxi- mization,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8. [35] S. Ali, A. Basharat, and M. Shah, “Chaotic invariants for human action recognition,” in Proc. IEEE 11th ICCV, Oct. 2007, pp. 1–8. [36] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired system for action recognition,” in Proc. IEEE 11th ICCV, Oct. 2007, pp. 1–8. [37] K. Schindler and L. Gool, “Action snippets: How many frames does human action recognition require?” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8. [38] T. Berg, A. Berg, and J. Shih, “Automatic attribute discovery and characterization from noisy web data,” in Proc. ECCV, 2010, pp. 663–676. [39] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 1778–1785. [40] S. E. Palmer, “Hierarchical structure in perceptual representation,” Cognit. Psychol., vol. 9, no. 4, pp. 441–474, 1977. [41] E. Wachsmuth, M. W. Oram, and D. I. Perrett, “Recognition of objects and their component parts: Responses of single units in the temporal cortex of the macaque,” Cereb. Cortex, vol. 4, no. 5, pp. 509–522, 1994. [42] P. Paatero and U. Tapper, “Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994. [43] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999. [44] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011. [45] Y. Wang and G. Mori, “Max-margin hidden conditional random fields for human action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 872–879. [46] D. Ramanan and D. Forsyth, “Automatic annotation of everyday move- ments,” in Advances in Neural Information Processing Systems. Cam- bridge, MA, USA: MIT Press, 2003.
  • 11. 580 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 [47] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies using diffusion distance,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 461–468. [48] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8. [49] H. Wang, A. Klaser, C. Schmid, and C. Liu “Dense trajectories and motion boundary descriptors for action recognition,” Int. J. Comput. Vis., vol. 103, no. 1, pp. 60–79, 2013. [50] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 951–958. [51] J. Vogel and B. Schiele, “Semantic modeling of natural scenes for content-based image retrieval,” Int. J. Comput. Vis., vol. 72, no. 2, pp. 133–157, 2007. [52] G. Wang, D. Hoiem, and D. Forsyth, “Learning image similarity from Flickr groups using stochastic intersection kernel machines,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 428–435. [53] L. J. Li, C. Wang, Y. Lim, D. Blei, and F. Li, “Building and using a semantivisual image hierarchy,” in Proc. IEEE Int. Conf. CVPR, Jun. 2010, pp. 3336–3343. [54] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient object category recognition using classemes,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 776–789. [55] A. Gilbert, J. Illingworth, and R. Bowden, “Fast realistic multi-action recognition using mined dense spatio-temporal features,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 925–931. [56] A. Kovashka and K. Grauman, “Learning a hierarchy of discriminative space-time neighborhood features for human action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2010, pp. 2046–2053. [57] C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank, “3D R transform on spatio-temporal interest points for action recognition,” in Proc. IEEE Conf. CVPR, Jun. 2013, pp. 724–730. [58] C. Ding, D. Zhou, X. He, and H. Zha, “R1-PCA: Rotational invariant L1-norm principal component analysis for robust subspace factoriza- tion,” in Proc. 23rd ICML, 2006, pp. 281–288. [59] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2007. [60] G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature selection,” Dept. Statist., Univ. California, Berkeley, CA, USA, Tech. Rep., 2006. [61] H. Huang and C. Ding, “Robust tensor factorization using R1 norm,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8. [62] F. R. K. Chung, Spectral Graph Theory. Providence, RI, USA: AMS, 1997. [63] W. Brendel and S. Todorovic, “Activities as time series of human postures,” in Proc. ECCV, 2010, pp. 721–734. [64] B. Li, M. Ayazoglu, T. Mao, O. Camps, and M. Sznaier, “Activity recognition using dynamic subspace angles,” in Proc. IEEE Int. Conf. CVPR, Jun. 2011, pp. 3193–3200. [65] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with indepen- dent subspace analysis,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3361–3368. [66] L. Yeffet and L. Wolf, “Local trinary patterns for human action recog- nition,” in Proc. IEEE 12th ICCV, Sep./Oct. 2009, pp. 492–497. [67] A. Yao, J. Gall, and L. Van Gool, “A Hough transform-based voting framework for action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2010, pp. 2061–2068. [68] Q. Qiu, Z. Jiang, and R. Chellappa, “Sparse dictionary-based repre- sentation and recognition of action attributes,” in Proc. IEEE ICCV, Nov. 2011, pp. 707–714. [69] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2008. [70] D. M. Bradley and J. A. Bagnell, “Differential sparse coding,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2008. [71] J. Liu, Y. Yang, I. Saleemi, and M. Shah, “Learning semantic features for action recognition via diffusion maps,” Comput. Vis. Image Understand., vol. 116, no. 3, pp. 361–377, 2012. [72] N. Ikizler-Cinbis and S. Sclaroff, “Object, scene and actions: Combining multiple features for human action recognition,” in Proc. ECCV, 2010, pp. 494–507. [73] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009, pp. 2929–2936. [74] S. Sadanand and J. Corso, “Action bank: A high-level representation of activity in video,” in Proc. IEEE Int. Conf. CVPR, Jun. 2012, pp. 1234–1241. [75] M. Raptis and L. Sigal, “Poselet key-framing: A model for human activity recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013, pp. 2650–2657. [76] T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models for joint action localization and recognition,” in Proc. IEEE ICCV, Nov. 2011, pp. 2003–2010. [77] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models for action detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013, pp. 1–8. [78] M. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Proc. IEEE ICCV, Nov. 2011, pp. 1036–1043. [79] Y. Zhang, X. Liu, M. Chang, X. Ge, and T. Chen, “Spatio-temporal phrases for activity recognition,” in Proc. ECCV, 2012, pp. 707–721. [80] A. Vahdat, B. Gao, M. Ranjbar, and G. Mori, “A discriminative key pose sequence model for recognizing human interactions,” in Proc. IEEE ICCV Workshops, Nov. 2011, pp. 1729–1736. [81] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with actoms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2782–2795, Nov. 2013. [82] M. Ullah, S. Parizi, and I. Laptev, “Improving bag-of-features action recognition with non-local cues,” in Proc. Brit. Mach. Vis. Conf., 2010, pp. 1–11. [83] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action parts from mid-level video representations,” in Proc. IEEE Int. Conf. CVPR, Jun. 2012, pp. 1242–1249. Haoran Wang received the B.S. degree from the Department of Information Science and Technology, Northeast University, Shenyang, China, in 2008. He is a Ph.D. student in School of Automation, Southeast University, Nanjing, China. His research interests include action recognition, motion analysis, and event detection Chunfeng Yuan received the B.S. and M.S. degrees in computer science from the Qingdao University of Science and Technology, China, in 2004 and 2007, respectively, and the Ph.D. degree in computer science from the Institute of Automation (CASIA), Chinese Academy of Sciences, Beijing, China, in 2010. She was an Assistant Professor at CASIA. Her research interests and publications range from statistics to computer vision, including sparse rep- resentation, motion analysis, action recognition, and event detection. Weiming Hu received the Ph.D. degree from the Department of Computer Science and Engineering, Zhejiang University, in 1998. From 1998 to 2000, he was a Post-Doctoral Research Fellow with the Insti- tute of Computer Science and Technology, Peking University. Currently, he is a Professor in the Insti- tute of Automation, Chinese Academy of Sciences. His research interests include visual surveillance and filtering of Internet objectionable information.
  • 12. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 581 Haibin Ling received the B.S. degree in mathe- matics and the M.S. degree in computer science from Peking University, China, in 1997 and 2000, respectively, and the Ph.D. degree from the Univer- sity of Maryland, College Park, in computer science in 2006. From 2000 to 2001, he was an Assistant Researcher at Microsoft Research Asia. From 2006 to 2007, he worked as a Post-Doctoral Scientist at the University of California Los Angeles. He joined Siemens Corporate Research as a Research Scientist. Since 2008, he has been an Assistant Professor at Temple University. His research interests include computer vision, medical image analysis, human computer interaction, and machine learning. He received the Best Student Paper Award at the ACM Symposium on User Interface Software and Technology in 2003. He is currently an Area Chair for CVPR 2014. Wankou Yang received the B.S., M.S., and Ph.D. degrees from the School of Computer Science and Technology, Nanjing University of Science and Technology, China, in 2002, 2004, and 2009, respec- tively. He is currently an Assistant Professor with School of Automation, Southeast University. His research interests include pattern recognition, com- puter vision, and digital machine learning. Changyin Sun is a Professor in the School of Automation, Southeast University, China. He received the M.S. and Ph.D. degrees in electri- cal engineering from Southeast University, Nan- jing, China, in 2001 and 2003, respectively. His research interests include intelligent control, neural networks, SVM, pattern recognition, and optimal theory. He received the First Prize of Nature Science of Ministry of Education, China. He is an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS, Neural Processing Letters, and the International Journal of Swarm Intelligence Research, Recent Patents on Computer Science.