2. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 571
Fig. 2. Flowchart of the proposed framework.
human actions representation. Typically, from an input human
action video, hundreds of interest points are first extracted
and then agglomerated into tens of action units, which then
compactly represent the video. Such a representation is more
discriminative than traditional BoVW model. To use it for
action recognition, we address the following three main issues.
1. Selecting low-level features for generating the action
unit. Some of the aforementioned features require reliable
tracking or body pose estimation, which is hard to achieve
in practice. The interest-point-based representation circumvent
avoid such requirement while being robust to noise, occlusion
and geometric variation. But traditional bag-of-visual-words
models (BoVW) utilize only features from individual interest
points and ignore spatial-temporal context information. To
address this issue, we propose a new context-aware descriptor
that incorporates context information from neighboring interest
points. This way, the new descriptor is more discriminative and
robust than the traditional BoVW.
2. Building an action unit set to represent all action classes
under investigation. Nonnegative Matrix Factorization
(NMF) [43] has received considerable attention and has
been shown to capture part-based representation in the
human brain [40], [41] as well as vision tasks [42], [43].
We propose using graph regularized Nonnegative Matrix
Factorization (GNMF) [44] to encode the geometrical
information by constructing a nearest neighbor graph. It finds
a part-based representation in which two data points are
connected if they are sufficiently close to each other. The
GNMF-based action units are automatically learned from
the training data and are capable of capturing the intra-class
variation of each action class.
3. Choosing discriminative action units and suppressing
noise in action classes. We propose a new action unit selection
method based on a joint l2,1-norm minimization. We first
introduce the l2,1-norm for vectors. Sparse model based on
such norm is robust to outliers and the regularization can
guide selecting action units across intra-class samples. The
dictionary learning process captures the fact that actions from
the same class share similar action units.
In this work we target learning high-level action units to
represent and classify human actions. For this purpose, we
improve over the traditional interest point feature and propose
an action unit based solution, which is further improved by an
action unit selection procedure. Fig. 2 illustrates the flowchart
of our framework. In summary, the training phase learns the
model for action units and the action classifier on the action
unit-based representation. The testing phase uses the learned
model for action prediction.
In the rest of the paper, Sec. 2 reviews the related work.
Sec. 3 introduces the new context-aware descriptor as the low-
level feature. Sec. 4 presents the GNMF-based action unit
as the high-level feature. Sec. 5 proposes the joint l2,1-norm
based action unit selection and a supervised dictionary learning
method for classification. Sec. 6 demonstrates the experimental
results. Sec. 7 concludes this paper.
II. RELATED WORK
Action recognition has been widely explored in the com-
puter vision community. Recently, some attempts have been
made to use the mid- or high-level concepts for human
action recognition. Liu et al. [34] exploit mutual informa-
tion maximization techniques to learn a compact mid-level
codebook and use the spatial-temporal pyramid to exploit
temporal information. Fathi et al. [48] extract discriminative
flow features within overlapping space-time cells and select
mid-level features via AdaBoost. Unfortunately, the global
binning makes the representation sensitive to position or
time shifts in the clip segmentation, and using predetermined
fixed-size spatial-temporal grid bins assumes that the proper
volume scale is known and uniform across action classes. Such
uniformity is not inherent in the features themselves, given
the large differences between the spatial-temporal distributions
of the features for different activities. Wang et al. [45] use
the hidden conditional random fields for action recognition.
The authors model an action class as a root template and a
constellation of hidden “parts”, where the hidden “part” is
a group of local patches that are implicitly correlated with
some intermediate representation. Ramanan et al. [46] first
track the persons with 3D pose estimation, which is then
used for action recognition. Liu et al. [47] use diffusion
maps to automatically learn a semantic visual vocabulary from
abundant quantized mid-level features, each represented by the
vector of pointwise mutual information. But the vocabularies
are created for individual categories, thus they are not univer-
sal and general enough, which limits their applications. Liu
et al. [6] learn data-driven attributes as the latent variables.
The authors augment the manually-specified attributes with the
automatically learned attributes to provide a complete charac-
terization of human actions. Compared with traditional low-
level features [13], [55]–[57], it is obvious that human actions
are more effectively represented by considering multiple high-
level semantic concepts. However, the current learning-based
visual representations obtain labels from the entire video,
and hence include background clutters which may degenerate
learning effectiveness.
High-level concepts have also been applied to object recog-
nition. Farhadi et al. [39] use a set of semantic attributes such
as ‘hairy’ and ‘four-legged’ to identify familiar objects, and
to describe unfamiliar objects when images and bounding box
annotations are provided. Lampert et al. [50] show that high-
level descriptors in terms of semantic attributes can be used
3. 572 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
to recognize object classes without any training image, once
semantic attribute classifiers are trained from other classes of
data. Vogel et al. [51] use attributes related to the scenes to
characterize image regions and combine these local semantic
attributes to form a global image description for natural scene
retrieval. Wang et al. [52] propose to represent an image by
its similarities to Flickr image groups which have explicit
semantic meanings. Classifiers are learned to predict the
membership of images to Flickr groups, and the class mem-
bership probabilities are used to define the image similarity.
Li et al. [53] build a semantically meaningful image hierarchy
by using both visual and semantic information, and represent
images by the estimated distributions of concepts over the
entire hierarchy. Torresani et al. [54] use the outputs of a large
number of object category classifiers to represent images.
Dictionary learning has been proven to be effective to
select discriminative low-level features for classification.
Liu et al. [3] use PageRank to mine the most informa-
tive static features. In order to further construct compact
yet discriminative visual vocabularies, a divisive information-
theoretic algorithm is employed to group semantically related
features. Brendel et al. [63] store multiple diverse exemplars
per activity class, and learn a sparse dictionary of most
discriminative exemplars. Qiu et al. [68] propose a Gaussian
process model to optimize the dictionary objective function.
The dictionary learning algorithm is based on sparse rep-
resentation which has recently received a lot of attention.
Mairal et al. [69] generalize the reconstructive sparse dictio-
nary learning process by optimizing the sparse reconstruction
jointly with a linear prediction model. Bradley and Bagnell
[70] propose a novel differentiable sparse prior rather than
the conventional L1 norm, and employ a back propagation
procedure to train the dictionary for sparse coding in order
to minimize the training error. These approaches need to
explicitly associate each sample with a label in order to
perform the supervised training. They aim at learning dis-
criminative sparse models instead of purely reconstructive
ones.
III. LOCALLY WEIGHTED WORD CONTEXT DESCRIPTOR
We propose a new context-aware descriptor called locally
weighted word context (LWWC) as the low-level descriptor.
LWWC encodes spatial context information rather than being
limited to a single interest point as used in traditional interest-
point-based descriptors. Such spatial context information is
extracted from neighboring of interest points, and can be
used to improve the robustness and discriminability of the
proposed descriptor.
We first perform the space-time interest point detection use
the methods in [12] and [13] for different data sets. These
interest points are initially described by histograms-of-optical-
flow (HOF) and histograms-of-oriented-gradients (HOG),
which respectively characterize the motion and appearance
within a volume surrounding an interest point. Afterwards,
we employ the k-means algorithm on these features to create
a vocabulary of size K. Following the BoVW, each interest
point is then converted to a visual word. Finally, for each
Fig. 3. Illustration of the locally weighted word context descriptor. (a) shows
the structure of this descriptor. LWWC is constructed by several neighboring
interest points together. “Dσ (p, qj )” is the distance between central interest
point and neighboring interest point. (b) shows the representation for the
central interest point by a K-by-1 vector applied to each local feature.
interest point together with its N − 1 nearest interest points,
the Locally Weighted Word Context descriptor (LWWC) is
calculated as following.
Let N(p) = {p, q1, · · · , qN−1} denote an interest point p
and its N − 1 nearest neighboring points. The N − 1 nearest
neighboring points are collected according to the normal-
ized Euclidean distance on their 3D position spatial-temporal
coordinates:
Dσ (p, q) =
i=1,2,3
1
σi
(p(i) − q(i))2
1/2
, (1)
where the three components in p and q record the horizontal,
vertical and temporal positions of the interest points respec-
tively; and σi is the corresponding weight.
Using the BoVW model, each point in N(p) is assigned
to a visual word. Therefore, the LWWC of N(p) is defined
as a vector of size K × 1 where K is the size of the
vocabulary. The k-th element element in the LWWC vector
is inversely proportional to the distance between the p and the
corresponding points in N. If multiple points in N belong to
the same visual word, their responses are summed together.
Specifically, we set the value to 1 corresponding to p, and
β
Dσ (p,qj ) for qj ∈ N. As shown in Fig. 3, the LWWC for
N(p) is denoted as follows:
Fp = [h1, h2, · · · , hK ]T
, (2)
hi =
1 if label(p) = i
N−1
j=1 β ·
δ(label(qj )−i)
Dσ (p,qj ) otherwise,
(3)
where label(p) denotes the word id of point p. Through
rebuilding the BOW model on the LWWC descriptors, each
action video is represented by a vector based on low-level
features.
IV. GNMF-BASED ACTION UNITS
NMF is a factorization algorithm for analyzing nonnegative
matrices. The nonnegative constraints allow only additive
4. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 573
Fig. 4. GNMF-based action units. (a) A video from class “walking” includes
two action units “translation of torso” and “leg motion”. (b) A video from
class “waving” includes two identical action units “arm motion”.
combinations of different bases. This is the most signifi-
cant difference between NMF and other matrix factorization
methods such as singular vector decomposition (SVD). NMF
can learn a part-based representation. But it fails to discover
the intrinsic geometrical and discriminative structure of the
data space, which is essential to the real-world applications.
If two data points are close in the intrinsic geometry of the
data distribution, the representations of these two points with
respect to the new bases should be still close to each other.
GNMF [44] aims to solve this problem. Most previous works
represent actions with low-level features. In this paper, we
propose to extract high-level action units based on GNMF to
better describe the human actions. The GNMF-based action
units are generated as follows.
Let y
j
i ∈ Rd, i = 1, · · · , C, j = 1, · · · , ni denote the
d-dimensional low-level feature representation of the j-th
video in class i. All such representations in class i form a
matrix Yi = [y1
i , · · · , yni
i ] ∈ Rd×ni . GNMF minimizes the
following objective function:
Q = Yi
− UV T
2
F
+ λTrace(V T
LV ) , (4)
where U ∈ Rd×si and V ∈ Rni ×si are two nonnegative
matrices, L = D − W the called graph Laplacian [62], W the
symmetric and nonnegative similarity matrix. We adopt the
heat kernel weight Wjl = e− 1
δ y
j
i −yl
i
2
. D is a diagonal matrix
whose entries are column (or row, since W is symmetric)
sums of W. Each column of matrix V T is a low-dimensional
representation of the corresponding column of Yi with respect
to the new bases. We define the column vectors of matrix U
as the action units belonging to action class i. Each element
of the column vector corresponds to a visual word which is
obtained from the low-level descriptors. Each column vector
of matrix U is a semantic representation constructed by several
correlated visual words. Repeating the same process for each
action class, we obtain all the class-specific action units.
As an example, suppose we have four action units forming
the bases: “translation of torso”, “up-down torso motion”,
“arm motion”, and “leg motion”. Then the action class
“walking” may be represented by an action-unit-based vec-
tor [1, 0, 1, 1] ∈ R4, and “waving” may be represented
by vector [0, 0, 2, 0] ∈ R4, with each dimension indicat-
ing the degree of the corresponding action unit as shown
in Fig. 4.
The action-unit-based representation has two main advan-
tages. First, it is compact since only tens of action units
to describe an action video. This is more efficient in than
BoVW models where hundreds of interest points are needed.
Second, some low-level local features are not discriminative,
and even have negative influence on classification. The process
of learning class-specific action units can suppress such noises.
The matrix factorization algorithm extracts the representative
action units for each action class. The representative action
units should exist in all the videos belonging to the same
action class. Some low-level local features that only exist
in a few intra-class videos are suppressed by the algorithm
mentioned above, and are not used for constructing the high-
level action units. The learned class-specific action units
can exhibit the characteristic of each action class. So, the
proposed action-unit-based representation is more powerful for
classification.
V. ROBUST ACTION UNIT SELECTION BASED ON
JOINT l2,1-NORMS
Recently sparse representation has received a lot of attention
in computer vision. Typically, it approximates the input signal
in terms of a sparse linear combination of the given over-
complete bases in dictionary. Such sparse representations
are usually derived by linear programming as an l1-norm
minimization problem. But the l1-norm based regularization
is sensitive to outliers. Inspired by the l2,1-norm of a matrix
[58], we first introduce the l2,1-norm of a vector. Moreover, we
propose a new joint l2,1-norm based sparse model to select the
representative action units for each action class. The proposed
sparse model mainly has two advantages for classification-
based action unit selection. First, the l2,1-norm of the matrix
in our sparse model encourages that the samples from the
same action class are constructed by similar action units,
and the action units which only appear in several intra-class
samples are suppressed. Second, each action class has its own
representative action units. The l2,1-norm of the vector in our
sparse model encourages each sample is constructed by the
action units from the same class.
A. Notations and Definitions
We first introduce the notations and the definition of norms.
For a matrix Z, its j-th row, k-th column are denoted by Z j·,
Z·k respectively, and z jk is an element in the j-th row and the
k-th column. The l2,1-norm of a matrix is introduced in [58]
as a rotational invariant l1-norm and also used for multi-task
learning [59], [60] and tensor factorization [61]. It is defined
as:
Z 2,1 =
n
j=1
z j· 2
=
n
j=1
r
k=1
z2
jk
1
2
. (5)
We first introduce the l2,1-norm of a vector. For
a vector b = [b1, b2, · · · , bn]T , the elements are
divided into G groups according to some rules, and
the number of elements in group g is mg, b =
[b11, · · · , b1m1 , · · · , bg1, · · · , bgmg , · · · , bG1, · · · , bGmG ]T .
The l2,1-norm of the vector b is defined as:
b 2,1 =
G
g=1
mg
k=1
b2
gk
1
2
. (6)
5. 574 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
B. Problem Formulation
Following the notation in Sec. IV and assume that the i-the
action class has mi learned action units, and C
i=1 mi = m.
We initialize the dictionary B = [B1, B2, . . . , BC ] such that
Bi = [bi1, . . . , bimi ], where bij denotes the j-th action unit
of the i-th class. The proposed sparse model for action-unit-
selection is
min
B,Xi
C
i=1
Yi
− BXi 2
F + γ1
k
Xi
·k 2,1+γ2 Xi
2,1 , (7)
where · F denotes the Frobenius-norm and · 2,1 the
l2,1-norm. Xi
·k 2,1
penalizes each group of elements,
corresponding to the columns of Bi, in vector Xi
·k as a whole
and enforces sparsity among the groups, so it encourages that
each action video is constructed by the action units from the
same class. Xi
2,1
penalizes each row of matrix Xi as a
whole and enforces sparsity among the rows, so it encourages
the videos from the same action class are constructed by sim-
ilar action units. γ1 and γ2 are the regularization parameters.
For classification, with a sparse model φ(yt , B), a predictive
model f (xt) = f (φ(yt, B)), a class label lt of the action
video yt, and a classification loss L(lt , f (xt ))= lt − f (xt) 2
2,
we desire to train the whole system with respect to the
dictionary B given P training samples.
min
B
E = min
B
P
t=1
L(lt, f (φ(yt , B))). (8)
The dictionary optimization is carried out using an iterative
approach composed of two steps: the sparse coding step on a
fixed B and the dictionary update step on fixed Xi .
Step 1: Sparse coding. Taking derivatives with respect to
Xi
·k(1 ≤ k ≤ ni ) according to (7) and setting it to zero, we
have
2BT
BXi
·k − 2BT
Yi
·k + γ1Dk Xi
·k + γ2E Xi
·k = 0 , (9)
where
E = diag
1
Xi
1· 2
,
1
Xi
2· 2
, · · · ,
1
Xi
m· 2
, (10)
Dk = diag(w1I(m1), w2I(m2), · · · , wC I(mC )) , (11)
where wj =
Mj
p=1+Mj −m j
(Xi
p,k)2 −1/2
, Mj =
j
l=1 ml,
I(m j ) is the m j × m j identity matrix, and diag (·) denotes a
diagonal matrix formed from elements of the vector.
From Equ. (9), we get
Xi
·k = 2(2BT
B + γ1Dk + γ2 E)−1
BT
Yi
·k . (12)
Note that Dk and E depend on Xi and thus are also unknown
variables. An iterative algorithm is proposed to solve this
problem, which is illustrated in Algorithm 1.
Step 2: Dictionary updating. Minimizing the loss func-
tion E over B will tighten the learned dictionary with the
classification model, and therefore improve the classification
Algorithm 1 An iterative algorithm for sparse coding
effectiveness. We compute the gradient of E with respect to B
according to Equ. (8):
∇B E =
P
t=1
∇B L =
P
t=1
∇ f L ·∇B f =
P
t=1
∇ f L ·∇xt f ·∇Bxt .
(13)
Therefore, the problem is reduced to compute the gradient of
the sparse representation xt with respect to the dictionary B.
According to Equ. (12), we have
∇B Xi
·k = (2BT
B + γ1Dk + γ2 E)−1
·
2∇B(BT
Yi
·k)−(2∇B(BT
B)+γ1∇BDk +γ2∇BE)Xi
·k . (14)
Through the above process, we obtain the optimized
dictionary.
For a test sample y, we get the action-unit-based represen-
tation x by
min y − Bx 2
F + γ1 x 2,1 . (15)
SVM is adopted as the predictive model for discriminative
dictionary learning and classification, and we employ the
generalized Gaussian kernel with the χ2 distance kernel, i.e.,
K(Hi, Hj) = exp −
1
A
χ2
(Hi, Hj) , (16)
for two histograms Hi, Hj, where A is the scale parameter set
as the mean distance between training samples.
VI. EXPERIMENTAL RESULTS
A. Data Sets
Five action data sets are used in our evaluation: the
KTH action data set [31], the UCF Sports data set [10], the
UT-Interaction data set [21], the UCF YouTube data set [3],
and the Hollywood2 data set [73]. Examples of these data sets
are shown in Fig. 5.
The KTH data set contains six single person actions
(boxing, hand waving, hand clapping, walking, jogging, and
running) performed repeatedly by 25 persons in four different
scenarios: outdoors, outdoors with camera zoom, outdoors
with different clothes, and indoors.
The UCF Sports is a challenging data set collecting a large
set of action clips from various broadcast sport videos. Actions
include diving, golf swinging, kicking, lifting, horseback
riding, running, skating, swinging, and walking. The actions
are captured in a wide range of scenes and view points.
6. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 575
Fig. 5. Representative frames from videos in five data sets. From top to
bottom: the KTH data set, the UCF Sports data set, the UT-Interaction data
set, the UCF YouTube data set, and the Hollywood2 data set.
The UT-Interaction data set has been used in the first
Contest on Semantic Description of Human Activities [21].
This data set contains action sequences of six interactions:
hug, kick, point, punch, push, and hand-shake. For classifica-
tion, 120 video segments cropped based on the ground-truth
bounding boxes and time intervals are provided by the data set
organizers. These segments are further divided into two sets,
and each set has 60 segments with 10 segments per class. Set 1
is captured at a parking lot and set 2 at a lawn.
The UCF YouTube data set contains 11 action categories:
basketball shooting, biking/cycling, diving, golf swinging,
horse back riding, soccer juggling, swinging, tennis swinging,
trampoline jumping, volleyball spiking, and walking with a
dog. This data set is challenging due to large variations in
camera motion, object appearance and pose, object scale,
viewpoint, cluttered background and illumination conditions.
The Hollywood2 data set has been collected from 69 differ-
ent Hollywood movies. There are 12 action classes: answering
the phone, driving car, eating, flighting, getting out of car, hand
shaking, hugging, kissing, running, sitting down, sitting up,
and standing up. In our experiments, we use the clean training
data set. In total, there are 1707 action samples divided into a
training set and a test set. Train and test sequences come from
different movies.
For the KTH and UT-Interaction data sets, the Harris3D
detector [13] is used for interest point extraction, and the
cuboid detector [12] is adopted for the other data sets.
B. Effects of the LWWC Descriptor
In our approach, we adopt the LWWC descriptor at low
level. Different from traditional interest-point-based methods,
the proposed descriptor contains the information of an area,
rather than a single interest point, by describing the distribution
of neighboring interest points.
Experiments are conducted to evaluate the influence of the
neighborhood information in the LWWC descriptor. Fig. 6
illustrates the recognition rates corresponding to different
scales of neighborhood information covered by the proposed
descriptor on the KTH and the challenging UT-Interaction
data sets. Traditional interest point based methods only utilize
features of a single interest point. It can only describe a very
small area. So the accuracy can be easily influenced by noise.
The recognition rate is 93.99% on the KTH data set. When
Fig. 6. Performance of LWWC on the KTH and UT-Interaction data sets.
Fig. 7. Performance of action unit selection on KTH and UT-Interaction.
the features of neighboring interest points are involved, the
proposed descriptor describes a larger area, and makes use
of more neighborhood information. So, the recognition rate is
raised to 95.49% when 8 nearest neighboring interest points
are collected for the LWWC descriptor. To further validate
the effectiveness of our descriptor, we also conduct similar
experiments on the UT-Interaction data set. In set 1, when we
adopt the feature of a single interest point, the accuracy is
only 78.3%. The accuracy is raised to 81.7% when 4 nearest
neighboring interest points are collected. In set 2, the accuracy
is raised to 80.0% from 68.3% when 8 nearest neighboring
interest points are collected. These experimental results show
that the recognition rate is improved when the neighborhood
information is incorporated into the LWWC descriptor.
C. Analysis of the Action Unit Selection
Based on low-level descriptors, the action units are learned
through GNMF. Among the learned action units, our proposed
joint l2,1-norm based sparse model aims at selecting the class-
specific representative action units to improve the recognition
performance. It encourages that the actions from the same
class are described by the same action units, and each action
is described by the action units from the same class.
To evaluate action unit selection, we compare the perfor-
mances of the original GNMF-based action units with the one
using action unit selection, as illustrated in Fig. 7. On the
KTH data set, the action unit selection significantly boosts the
performance from 92.65% to 95.49%. On the UT-Interaction
data set, it again significantly boost the recognition accuracies
from 80.0% to 81.7% (on Set 1) and from 70.0% to 80.0%
(on Set 2). These results clearly validate that the proposed
joint l2,1-norm based action unit selection method is effective
to improve the recognition performance.
7. 576 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
Fig. 8. Confusion matrix of the classification on the KTH data set.
TABLE I
COMPARISON WITH PREVIOUS METHODS ON THE
KTH DATA SET
D. Experiments on the KTH Data Set
Consistent with the experiment setting used in
[13], [31], [34], we test the proposed approach on the
entire KTH data set [31], in which videos of four scenarios
are mixed together. We split the data set into a training part
with 24 persons’ videos and a test part with the remaining
videos. The final result is the average of 25 times runs.
For the sparse model based action unit selection, we set the
tradeoff parameters γ1 = γ2 = 0.2.
Fig. 8 presents the confusion matrix of the classification on
the KTH data set. It shows that our approach works excellently
on most actions such as “hand waving” and “boxing”. The
main confusion occurs between “jogging” and “running”,
since the actions performed by some actors are very similar.
Table I lists the average accuracies of our method and other
recently proposed ones. It shows that our method achieves
excellent performance (95.49%), which is comparable to the
best reported results.
The experimental results validate the effectiveness of the
proposed method. Furthermore, we compare the performances
of different baseline approaches (such as traditional single
interest point feature, the proposed LWWC descriptor, l1-norm
based sparse model, and our action unit selection approach),
and study the contribution of each part in our method, as
illustrated in Table II. The accuracy of traditional single
TABLE II
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE KTH DATA SET
TABLE III
CLASSIFICATION ACCURACIES ON THE KTH DATA SET
interest point feature is 91.80%. If we only use the LWWC
descriptor, the accuracy is 92.65%. When we only adopt the
action unit selection based on the traditional single interest
point feature, the accuracy is 93.99%. Combining both, the
accuracy reaches 95.49%. When we use the l1-norm based
on LWWC descriptors, the accuracy is 92.82%. The study
demonstrates that each of the proposed approaches offers
more discriminative power than the BoVW baseline, and the
l2,1-norm based action unit selection approach obtains better
performance than the l1-norm based sparse model. It further
validates the effectiveness of the high-level descriptor for
classification. Our method, which combines the low-level
LWWC descriptor with the high-level action unit selection,
achieves the best performance. Furthermore, our method is
compared with some basic cases, as shown in Table III. The
comparisons validate that the proposed method improves the
performance of traditional methods.
E. Experiments on the UCF Sports Data Set
Most previously reported results on this data set use leave-
one-out (LOO) manner, cycling each example as the test
video one at a time. But Lan et al. [76] propose to split the
data set into disjoint training and testing sets to avoid the
background regularity for evaluation. We report our results in
both protocols. For the realistic data set, we perform dense and
multi-scale interest point extraction. To generate the codebook,
we empirically set the codebook size k to 1000. For the
sparse model based action unit selection, we set the tradeoff
parameters γ1 = γ2 = 0.2.
Fig. 9 presents the confusion matrix across all scenarios
in the leave-one-out protocol on the UCF Sports data set.
Our method works well on most actions. For example, the
recognition accuracies for some actions are high up to 100%,
such as “diving” and “lifting”. There are complex backgrounds
in this data set, and some actions are very similar and
challenging for recognition, such as “golfing”, “horseback
riding”, and “running”. We conduct further experiments on
8. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 577
Fig. 9. Confusion matrix on the UCF Sports data set (LOO).
TABLE IV
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE UCF SPORTS DATA SET (SPLIT)
TABLE V
COMPARISON WITH PREVIOUS METHODS ON THE
UCF SPORTS DATA SET
the UCF Sports data set to study different components of the
proposed approach, similar as on the KTH data set. Table IV
illustrates the comparison of the two proposed approaches
with their combination in Lan’s manner (Split). The effec-
tiveness of each proposed ingredients is again confirmed as
the results on the ETH data set. Table V compares the overall
mean accuracy of our method with the results reported by
previous researchers. Our average recognition accuracy is
competitive with most reported results except the action bank
method [74].
F. Experiments on the UT-Interaction Data Set
The action videos in UT-Interaction are divided into two
sets. To generate the codebook, we empirically set the code-
book size k to 500 in set 1, and set the codebook size
Fig. 10. Confusion matrices on the UT-Interaction data set. (a) Set 1.
(b) Set 2.
TABLE VI
CONTRIBUTIONS OF THE PROPOSED DIFFERENT APPROACHES TO
CLASSIFICATION ON THE UT-INTERACTION DATA SET
TABLE VII
COMPARISON WITH PREVIOUS METHODS ON THE UT-INTERACTION
DATA SET. THE THIRD AND FOURTH COLUMNS REPORT THE
RECOGNITION RATE USING THE FIRST HALF
AND THE ENTIRE VIDEOS RESPECTIVELY
k to 300 in set 2. We set the tradeoff parameters γ1 =
γ2 = 0.2 for both sets. We perform the leave-one-out test
strategy.
Fig. 10 presents the confusion matrices across all scenarios
in set 1 and set 2. The recognition accuracies for some actions
are excellent, such as “point” in set 2 and “push”. Table VI
illustrates the comparison of the two proposed approaches with
their combination in both set 1 and set 2. Each approach offers
more discriminative power than traditional single interest
point feature. The LWWC descriptor performs better than the
action unit selection method in both sets, and the combina-
tion of them provides the best performance. The l2,1-norm
based action selection method outperforms the l1-norm based
sparse model. Also, the result of our method is compared
with those reported by previous researchers in Table VII,
and some basic cases in Table VIII. The proposed method
9. 578 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
TABLE VIII
CLASSIFICATION ACCURACIES OF DIFFERENT METHODS ON THE
UT-INTERACTION DATA SET (SET 1 AND SET 2)
Fig. 11. Confusion matrix on the UCF YouTube data set.
Fig. 12. Performance comparisons among some relative interest-point-based
approaches on the UCF YouTube Data Set.
outperforms most other methods and achieves a competitive
result.
G. Experiments on the UCF YouTube Data Set
We follow the original setup [3] using leave one out cross
validation for a pre-defined set of 25 folds, and perform dense
interest point extraction. The codebook size is set to 1000, and
the tradeoff parameters γ1 = γ2 = 0.3.
Fig. 11 presents the confusion matrix across all scenarios on
the UCF YouTube data set. In Fig. 12, we compare per action
class performances of relative methods including cuboid,
TABLE IX
ACCURACIES ON THE UCF YOUTUBE DATA SET
Fig. 13. Performance of different versions of the proposed approach on the
Hollywood2 data set.
TABLE X
COMPARISON WITH PREVIOUS METHODS ON THE
HOLLYWOOD2 DATA SET
LWWC, action unit selection and some previously reported
interest-point-based methods, and our method outperforms
others. Table IX compares the overall mean accuracy of our
method with the reported by previous researchers. Our average
recognition accuracy is 82.2%, which is comparable to the
state-of-the-art performance and outperforms other interest-
point-based methods.
H. Experiments on the Hollywood2 Data Set
Similar to the parameters used in the YouTube data set, the
codebook size k is empirically set to 1000, and we set the
tradeoff parameters γ1 = γ2 = 0.3.
Fig. 13 presents the performance of our method, and
tests the contribution of each part in our method to the
recognition accuracy respectively. The accuracy of traditional
single interest point feature is 47.9%. If we only utilize the
LWWC descriptor, the accuracy is 50.1%. When we only
adopt the action unit selection based on the traditional single
interest point feature, the accuracy is 54.5%. In combination,
the accuracy of our method is 56.8%. Table X compares the
overall mean accuracy of our method with the results reported
by previous researchers. Our average recognition accuracy is
better than or comparable to the state-of-the-art performances.
10. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 579
VII. CONCLUSION
In this paper, we have proposed to represent human actions
by a set of intermediate concepts called action units which
are automatically learned from the training data. At low level,
we have presented a locally weighted word context descriptor
to improve the traditional interest-point-based representation.
The proposed descriptor incorporates the neighborhood infor-
mation effectively. At high level, we have introduced the
GNMF-based action units to bridge the semantic gap in
action representation. Furthermore, we have proposed a new
joint l2,1-norm based sparse model for action unit selection
in a discriminative fashion. Extensive experiments have been
carried out to validate our claims and have confirmed our
intuition that the action unit based representation is critical
for modeling complex activities from videos.
REFERENCES
[1] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine
recognition of human activities: A survey,” IEEE Trans. Circuits Syst.
Video Technol., vol. 18, no. 11, pp. 1473–1488, Sep. 2008.
[2] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human
action categories using spatial-temporal words,” in Proc. Int. J. Comput.
Vis., vol. 79, no. 3, pp. 299–318, Sep. 2008.
[3] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos
‘in the wild’,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2009, pp. 1996–2003.
[4] J. Niebles, C. Chen, and L. Fei-Fei, “Modeling temporal structure of
decomposable motion segments for activity classification,” in Proc. Eur.
Conf. Comput. Vis., 2010, pp. 392–405.
[5] H. Wang, C. Yuan, W. Hu, and C. Sun, “Supervised class-specific
dictionary learning for sparse modeling in action recognition,” Pattern
Recognit., vol. 45, no. 11, pp. 3902–3911, 2012.
[6] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by
attributes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2011, pp. 3337–3344.
[7] J. Liu, S. Ali, and M. Shah, “Recognizing human actions using multiple
features,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Jun. 2008, pp. 1–8.
[8] K. Rapantzikos, Y. Avrithis, and S. Kollias, “Dense saliency-based
spatiotemporal feature points for action recognition,” in Proc. IEEE Int.
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1454–1461.
[9] W. Lee and H. Chen, “Histogram-based interest point detectors,” in
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009,
pp. 1590–1596.
[10] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-
temporal maximum average correlation height filter for action recogni-
tion,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2008,
pp. 1–8.
[11] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation
of local spatio-temporal features for action recognition,” in Proc. Brit.
Mach. Vis. Conf., 2009, pp. 1–11.
[12] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition
via sparse spatiotemporal features,” in Proc. 2nd Joint IEEE Int. Work-
shop Vis. Surveill. Perform. Eval. Track. Surveill., Oct. 2005, pp. 65–72.
[13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning
realistic human actions from movies,” in Proc. IEEE Int. Conf. Comput.
Vis. Pattern Recognit., Jun. 2008, pp. 1–8.
[14] S. Ali and M. Shah, “Human action recognition in videos using
kinematic features and multiple instance learning,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 32, no. 2, pp. 288–303, Feb. 2010.
[15] A. F. Bobick and J. W. Davis, “The recognition of human movement
using temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 23, no. 3, pp. 1257–1265, Mar. 2001.
[16] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,”
in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2005,
pp. 984–989.
[17] M. Blank, M. Irani, and R. Basri, “Actions as space-time shapes,” in
Proc. 10th IEEE ICCV, Oct. 2005, pp. 1395–1402.
[18] Z. Lin, Z. Jiang, and L. S. Davis, “Recognizing actions by shape-
motion prototype trees,” in Proc. IEEE 12th Int. Conf. Comput. Vis.,
Sep./Oct. 2009, pp. 444–451.
[19] F. Lv and R. Nevatia, “Single view human action recognition using key
pose matching and viterbi path searching,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2007, pp. 1–8.
[20] H. Wang, C. Yuan, G. Luo, W. Hu, and C. Sun, “Action recognition
using linear dynamic systems,” Pattern Recognit., vol. 46, no. 6,
pp. 1710–1718, 2013.
[21] M. S. Ryoo and J. K. Aggarwal. (2010). An Overview of Contest on
Semantic Description of Human Activities (SDHA) [Online]. Available:
http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
[22] M. Raptis and S. Soatto, “Tracklet descriptors for action modeling and
video analysis,” in Proc. ECCV, 2010, pp. 577–590.
[23] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S. Chua, and J. Li, “Hierarchical
spatio-temporal context modeling for action recognition,” in Proc. IEEE
Int. Conf. CVPR, Jun. 2009, pp. 2004–2011.
[24] M. Bregonzio, S. Gong, and T. Xiang, “Recognising action as clouds of
space-time interest points,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 1948–1955.
[25] S. F. Wong, T. K. Kim, and R. Cipolla, “Learning motion categories
using both semantic and structural information,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2007, pp. 1–6.
[26] S. Savarese, A. Delpozo, J. Niebles, and L. Fei-Fei, “Spatial-temporal
correlations for unsupervised action classification,” in Proc. IEEE Work-
shop Motion Video Comput., Jan. 2008, pp. 1–8.
[27] I. Kotsia, S. Zafeiriou, and I. Pitas, “Texture and shape information
fusion for facial expression and facial action unit recognition,” Pattern
Recognit., vol. 41, no. 3, pp. 833–851, 2008.
[28] H. Jiang, M. Crew, and Z. Li, “Successive convex matching for action
detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2006, pp. 1646–1653.
[29] A. Klaser, M. Marszalek, and C. Schmid, “A spatio-temporal descriptor
based on 3D-gradients,” in Proc. Brit. Mach. Vis. Conf., 2008, pp. 1–10.
[30] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional SIFT descriptor
and its application to action recognition,” in Proc. ACM 15th Int. Conf.
Multimedia, 2007, pp. 357–360.
[31] C. Schuldt, I. Laptive, and B. Caputo, “Recognizing human actions:
A local SVM approach,” in Proc. IEEE 17th ICPR, vol. 3. Aug. 2004,
pp. 32–36.
[32] Z. Zhang, Y. Hu, S. Chan, and L. Chia, “Motion context: A new
representation for human action recognition,” in Proc. ECCV, 2008,
pp. 817–829.
[33] Y. Wang and G. Mori, “Human action recognition by semi-latent topic
models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10,
pp. 1762–1774, Oct. 2009.
[34] J. Liu and M. Shah, “Learning human actions via information maxi-
mization,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[35] S. Ali, A. Basharat, and M. Shah, “Chaotic invariants for human action
recognition,” in Proc. IEEE 11th ICCV, Oct. 2007, pp. 1–8.
[36] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired
system for action recognition,” in Proc. IEEE 11th ICCV, Oct. 2007,
pp. 1–8.
[37] K. Schindler and L. Gool, “Action snippets: How many frames does
human action recognition require?” in Proc. IEEE Int. Conf. CVPR,
Jun. 2008, pp. 1–8.
[38] T. Berg, A. Berg, and J. Shih, “Automatic attribute discovery
and characterization from noisy web data,” in Proc. ECCV, 2010,
pp. 663–676.
[39] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects
by their attributes,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 1778–1785.
[40] S. E. Palmer, “Hierarchical structure in perceptual representation,”
Cognit. Psychol., vol. 9, no. 4, pp. 441–474, 1977.
[41] E. Wachsmuth, M. W. Oram, and D. I. Perrett, “Recognition of objects
and their component parts: Responses of single units in the temporal
cortex of the macaque,” Cereb. Cortex, vol. 4, no. 5, pp. 509–522, 1994.
[42] P. Paatero and U. Tapper, “Positive matrix factorization: A nonnegative
factor model with optimal utilization of error estimates of data values,”
Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
[43] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative
matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999.
[44] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
matrix factorization for data representation,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
[45] Y. Wang and G. Mori, “Max-margin hidden conditional random fields for
human action recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 872–879.
[46] D. Ramanan and D. Forsyth, “Automatic annotation of everyday move-
ments,” in Advances in Neural Information Processing Systems. Cam-
bridge, MA, USA: MIT Press, 2003.
11. 580 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014
[47] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies
using diffusion distance,” in Proc. IEEE Int. Conf. CVPR, Jun. 2009,
pp. 461–468.
[48] A. Fathi and G. Mori, “Action recognition by learning mid-level motion
features,” in Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[49] H. Wang, A. Klaser, C. Schmid, and C. Liu “Dense trajectories and
motion boundary descriptors for action recognition,” Int. J. Comput.
Vis., vol. 103, no. 1, pp. 60–79, 2013.
[50] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen
object classes by between-class attribute transfer,” in Proc. IEEE Int.
Conf. CVPR, Jun. 2009, pp. 951–958.
[51] J. Vogel and B. Schiele, “Semantic modeling of natural scenes for
content-based image retrieval,” Int. J. Comput. Vis., vol. 72, no. 2,
pp. 133–157, 2007.
[52] G. Wang, D. Hoiem, and D. Forsyth, “Learning image similarity from
Flickr groups using stochastic intersection kernel machines,” in Proc.
IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 428–435.
[53] L. J. Li, C. Wang, Y. Lim, D. Blei, and F. Li, “Building and using
a semantivisual image hierarchy,” in Proc. IEEE Int. Conf. CVPR,
Jun. 2010, pp. 3336–3343.
[54] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient object category
recognition using classemes,” in Proc. Eur. Conf. Comput. Vis., 2010,
pp. 776–789.
[55] A. Gilbert, J. Illingworth, and R. Bowden, “Fast realistic multi-action
recognition using mined dense spatio-temporal features,” in Proc. IEEE
12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 925–931.
[56] A. Kovashka and K. Grauman, “Learning a hierarchy of discriminative
space-time neighborhood features for human action recognition,” in
Proc. IEEE Int. Conf. CVPR, Jun. 2010, pp. 2046–2053.
[57] C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank, “3D R transform
on spatio-temporal interest points for action recognition,” in Proc. IEEE
Conf. CVPR, Jun. 2013, pp. 724–730.
[58] C. Ding, D. Zhou, X. He, and H. Zha, “R1-PCA: Rotational invariant
L1-norm principal component analysis for robust subspace factoriza-
tion,” in Proc. 23rd ICML, 2006, pp. 281–288.
[59] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”
in Advances in Neural Information Processing Systems. Cambridge, MA,
USA: MIT Press, 2007.
[60] G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature selection,”
Dept. Statist., Univ. California, Berkeley, CA, USA, Tech. Rep., 2006.
[61] H. Huang and C. Ding, “Robust tensor factorization using R1 norm,” in
Proc. IEEE Int. Conf. CVPR, Jun. 2008, pp. 1–8.
[62] F. R. K. Chung, Spectral Graph Theory. Providence, RI, USA: AMS,
1997.
[63] W. Brendel and S. Todorovic, “Activities as time series of human
postures,” in Proc. ECCV, 2010, pp. 721–734.
[64] B. Li, M. Ayazoglu, T. Mao, O. Camps, and M. Sznaier, “Activity
recognition using dynamic subspace angles,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2011, pp. 3193–3200.
[65] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical
invariant spatio-temporal features for action recognition with indepen-
dent subspace analysis,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern
Recognit., Jun. 2011, pp. 3361–3368.
[66] L. Yeffet and L. Wolf, “Local trinary patterns for human action recog-
nition,” in Proc. IEEE 12th ICCV, Sep./Oct. 2009, pp. 492–497.
[67] A. Yao, J. Gall, and L. Van Gool, “A Hough transform-based voting
framework for action recognition,” in Proc. IEEE Int. Conf. CVPR,
Jun. 2010, pp. 2061–2068.
[68] Q. Qiu, Z. Jiang, and R. Chellappa, “Sparse dictionary-based repre-
sentation and recognition of action attributes,” in Proc. IEEE ICCV,
Nov. 2011, pp. 707–714.
[69] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised
dictionary learning,” in Advances in Neural Information Processing
Systems. Cambridge, MA, USA: MIT Press, 2008.
[70] D. M. Bradley and J. A. Bagnell, “Differential sparse coding,” in
Advances in Neural Information Processing Systems. Cambridge, MA,
USA: MIT Press, 2008.
[71] J. Liu, Y. Yang, I. Saleemi, and M. Shah, “Learning semantic features for
action recognition via diffusion maps,” Comput. Vis. Image Understand.,
vol. 116, no. 3, pp. 361–377, 2012.
[72] N. Ikizler-Cinbis and S. Sclaroff, “Object, scene and actions: Combining
multiple features for human action recognition,” in Proc. ECCV, 2010,
pp. 494–507.
[73] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in Proc.
IEEE Int. Conf. CVPR, Jun. 2009, pp. 2929–2936.
[74] S. Sadanand and J. Corso, “Action bank: A high-level representation
of activity in video,” in Proc. IEEE Int. Conf. CVPR, Jun. 2012,
pp. 1234–1241.
[75] M. Raptis and L. Sigal, “Poselet key-framing: A model for human
activity recognition,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013,
pp. 2650–2657.
[76] T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models
for joint action localization and recognition,” in Proc. IEEE ICCV,
Nov. 2011, pp. 2003–2010.
[77] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part
models for action detection,” in Proc. IEEE Int. Conf. CVPR, Jun. 2013,
pp. 1–8.
[78] M. Ryoo, “Human activity prediction: Early recognition of ongoing
activities from streaming videos,” in Proc. IEEE ICCV, Nov. 2011,
pp. 1036–1043.
[79] Y. Zhang, X. Liu, M. Chang, X. Ge, and T. Chen, “Spatio-temporal
phrases for activity recognition,” in Proc. ECCV, 2012, pp. 707–721.
[80] A. Vahdat, B. Gao, M. Ranjbar, and G. Mori, “A discriminative key
pose sequence model for recognizing human interactions,” in Proc. IEEE
ICCV Workshops, Nov. 2011, pp. 1729–1736.
[81] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of
actions with actoms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
no. 11, pp. 2782–2795, Nov. 2013.
[82] M. Ullah, S. Parizi, and I. Laptev, “Improving bag-of-features action
recognition with non-local cues,” in Proc. Brit. Mach. Vis. Conf., 2010,
pp. 1–11.
[83] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action
parts from mid-level video representations,” in Proc. IEEE Int. Conf.
CVPR, Jun. 2012, pp. 1242–1249.
Haoran Wang received the B.S. degree from the
Department of Information Science and Technology,
Northeast University, Shenyang, China, in 2008.
He is a Ph.D. student in School of Automation,
Southeast University, Nanjing, China. His research
interests include action recognition, motion analysis,
and event detection
Chunfeng Yuan received the B.S. and M.S. degrees
in computer science from the Qingdao University
of Science and Technology, China, in 2004 and
2007, respectively, and the Ph.D. degree in computer
science from the Institute of Automation (CASIA),
Chinese Academy of Sciences, Beijing, China, in
2010. She was an Assistant Professor at CASIA.
Her research interests and publications range from
statistics to computer vision, including sparse rep-
resentation, motion analysis, action recognition, and
event detection.
Weiming Hu received the Ph.D. degree from the
Department of Computer Science and Engineering,
Zhejiang University, in 1998. From 1998 to 2000, he
was a Post-Doctoral Research Fellow with the Insti-
tute of Computer Science and Technology, Peking
University. Currently, he is a Professor in the Insti-
tute of Automation, Chinese Academy of Sciences.
His research interests include visual surveillance and
filtering of Internet objectionable information.
12. WANG et al.: ACTION RECOGNITION USING NONNEGATIVE ACTION COMPONENT REPRESENTATION 581
Haibin Ling received the B.S. degree in mathe-
matics and the M.S. degree in computer science
from Peking University, China, in 1997 and 2000,
respectively, and the Ph.D. degree from the Univer-
sity of Maryland, College Park, in computer science
in 2006. From 2000 to 2001, he was an Assistant
Researcher at Microsoft Research Asia. From 2006
to 2007, he worked as a Post-Doctoral Scientist at
the University of California Los Angeles. He joined
Siemens Corporate Research as a Research Scientist.
Since 2008, he has been an Assistant Professor at
Temple University. His research interests include computer vision, medical
image analysis, human computer interaction, and machine learning. He
received the Best Student Paper Award at the ACM Symposium on User
Interface Software and Technology in 2003. He is currently an Area Chair
for CVPR 2014.
Wankou Yang received the B.S., M.S., and Ph.D.
degrees from the School of Computer Science and
Technology, Nanjing University of Science and
Technology, China, in 2002, 2004, and 2009, respec-
tively. He is currently an Assistant Professor with
School of Automation, Southeast University. His
research interests include pattern recognition, com-
puter vision, and digital machine learning.
Changyin Sun is a Professor in the School
of Automation, Southeast University, China. He
received the M.S. and Ph.D. degrees in electri-
cal engineering from Southeast University, Nan-
jing, China, in 2001 and 2003, respectively. His
research interests include intelligent control, neural
networks, SVM, pattern recognition, and optimal
theory. He received the First Prize of Nature Science
of Ministry of Education, China. He is an Associate
Editor of the IEEE TRANSACTIONS ON NEURAL
NETWORKS, Neural Processing Letters, and the
International Journal of Swarm Intelligence Research, Recent Patents on
Computer Science.