Monocular Human Pose Estimation with Bayesian Networks

本著作採用創用CC 「姓名標示」授權條款台灣3.0版

Monocular Human Pose Estimation
with Bayesian Networks
Yuan-Kai Wang

Electronic Engineering Department,
Fu Jen University
2010/6/11

Wang, Yuan-Kai Electronic Engineering Department, Fu Jen University 2

Outline
1. Introduction
2. Markless Monocular Human Pose
Estimation
3. Overview of the Approach
4. Model Learning by EM algorithm
5. Pose Estimation by Approximate Inference
6. Feature Extraction
7. Experimental Results
8. Conclusions


1. Introduction
• Applications of Human Motion
Capture
– Performance animation in movie making
– Game
– Medical diagnosis
– Sport & Health
– Visual surveillance


Performance Animation
• Avatar • The Lord of the
Rings


Game
• Microsoft's Project Natal for XBOX360


Medical Diagnosis
• Gait analysis for
Rehabilitation


Sport & Health
• Golf training


Visual Surveillance
• Behavior analysis for event detection
– Irregular movement, body language, and
unusual interactions, fighting
– Car crash
• Content-based retrieval


Sensor Approaches
• Active sensors
– Types
• Electro-magnetic marker
• Optical
• Accelerometer
– Wired connection
– Drawbacks Too
• Intrusive Many
• Expensive Wires
• Time consuming
• Passive sensors
by camera
– Marker-based
– Markerless


Marker-based Sensors
• Add visual markers on body
– Active marker
• Visual/non-visual light
– Passive marker
• Need computer vision algorithms Active
• Advantages marker
– No wires
• Drawbacks
– Semi-intrusive Passive
– Time consuming marker


Markerless Sensors
• No attachment on human body
• Heavily dependent on Pure vision
computer vision analyzer solution
– Stereo/Multiple cameras
– Monocular cameras


Sensor v.s. Analyzer

T. B. Moeslund, "Computer vision-based human motion capture – a
survey", Technical report LIA 99-02, University of AALBORG, 1999.


Pose Estimation
v.s. Gesture Recognition

Pose Estimation

Gesture
Recognition
Walking


2D v.s. 3D


2. Markerless Monocular
Human Motion Capture
• Goal
– Markless
– Single camera
– 3D poses
• Challenges
– Ill-posed
– Highly articulated Depth ambiguities &
occlusion using
– Self-occluding monocular silhouettes


Joint Representation
• Articulated human body is linked by
joints


Abstract Representation
2D 3D

Stick

Surface/
Volume


Literature Review
Low-Level High-Level
Observation Abstraction
• Background subtraction P=f(S)
• Object detection P=f(F) P=f(J)
Marker-based
Image Human Image 2D Joint 3D Model
Space Segmentation Feature Location Parametric Space
(Pixel domain) (S) Descriptor (F) (J) (Pose domain, P)
• Full body • Shape •Joint angle
X

• Body • Silhouette
parts • Color Θi Left Right

• Appearance shoulder
Neck
shoulder

Left Right

• Motion •Joint
elbow

Left
Left Bottom Right
waist waist
elbow

Right

• Feature location
hand hand y

Left Right

point Pi knee

Left
knee

Right
foot

(corner) Z
foot

• ...

A two-stage approach is proposed P=f1(f2(F))


Approaches
• Model-free [Agarwal, 2006] [Loy, 2004]
– No utilization of joints articulation to
constrain the search of function mapping
P = f(X)
• Model-based [Rbert, 2006] [Rohr, 1994]
– A model of human articulation to
constrain the search of f and P
– Two kinds of approach
• Discriminative
• Generative: Bayesian networks (BNs)
Training : f = arg max L1 (Training, f )
ˆ
f

Inference : P = arg max L2 ( f | X , P)
ˆ ˆ
P


An Articulated Model
= A Bayesian Network
• Human body is represented as a
kinematics tree, consisting of divisions
linking by joints
• Kinematics models are addressed with
X

graphical probability network Left Right
shoulder shoulder
Neck

• Graphical probability models are Left
elbow

Left
hand
Left Bottom Right
waist waist
Right
elbow

Right
hand y

computed via Bayesian network Left
knee

Left
foot
Right
knee

Right
foot

Z


Three Steps to Utilize BNs
• Representation, learning and inference

X1
Joints

f = arg max L1 (Training, f )
ˆ Representation
f
Feature-Joint correspondence X2 X3 X4
by Conditional
Probability Features
Learning

X1

P(X1|X2,X3,X4) Inference
Pose Estimation
P = arg max L ( f | X , P)
ˆ ˆ
2 X2 X3 X4
P


Two Causal Models in BNs
• Undirected acyclic graph [Lan, 2008] [Hua, 2005]
– Bayesian network is a tree or a graph model
that the linking edge between two nodes has no
direction.
P(X1,X2)
X1 X2

• Directed acyclic graph [Ramanan, 2007] [Lee, 2006] [Leonid, 2003]
– Every node has directed arcs linked to another
node. P(X1|X2)
X1 X2


Directed Bayesian Articulated
Model
• Nodes in directed acyclic graph (DAG) are
not influenced by their child nodes.
• Human body parts are not regarded as two-
way h2d,2

h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8

h2d,10 h2d,9 h2d,11

h2d,12 h2d,13

h2d,14 h2d,15


Inference of Bayesian Networks
• Top-down approach [Gavrila, 1996]

– Has the strength at finding human body parts
in the image.
• Bottom-up approach [Ren, 2005]

– Has the strength at finding people in the image.
• Combined approach [Navaraman, 2005][Lee, 2002]

– Has the benefit from the advantages of both.


3. Overview of the Approach
2D X
3D
Head

Left Right
shoulder shoulder
Left Right
Neck shoulder
Right shoulder
Left Neck
elbow elbow
Left Right
Bottom elbow elbow
Left Left Bottom Right
Right waist waist
hand Left Right
Left Right hand hand
hand y
waist waist
Left Right Left Right
knee knee knee knee
Left Right
foot foot
Left Right
foot foot
Z

They are belief propagation networks using
an annealing Gibbs sampling algorithm.


System Architecture
• We estimate the 2D human joint
positions before 3D estimation.
Testing image

2D Model Training

2D Bayesian
Feature
Human Model
Extraction
Setting
3D Model Training

2D Bayesian
Training 3D Bayesian
Inference with
Features EM Training Human Model
Annealed Gibbs
Setting
Sampling

3D Bayesian
Inference with Training
EM Training Features
Annealed Gibbs
Sampling

Result


2D Human Graphical Model
• The articulated structure of 2D human
body is represented by a 15-node graphical
model.
Head
H 2 D = {h2 d ,1 ,..., h2 d ,15}
h2d,2
Left Right
shoulder shoulder
Neck
h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8
Left Right
elbow elbow

Bottom
Left Right
hand h2d,10 h2d,9 h2d,11
Left Right hand
waist waist
Left Right
knee knee h2d,12 h2d,13

Left Right
foot foot
h2d,14 h2d,15

2D stick figure (articulated model)


3D Human Graphical Model
• 3D human body model is described by a 45D
vector H3D representing joint positions for
dimensions of each joint node in the 3D space
X

H 3 D = {h3d ,1 ,..., h3d ,15}
h3d,15
Left Right
shoulder shoulder
Neck h3d,1
h3d,2 h3d,3
Left Right
elbow elbow
Left Bottom Right h3d,4 h3d,5
waist waist Right
Left
hand hand y
h3d,6 h3d,8 h3d,7
h3d,9 h3d,10
Left Right
knee knee
Left h3d,11 h3d,12
Right
foot foot

Z h3d,13 h3d,14

3D stick figure (articulated model)


The BN Model
• A directed acyclic graph
 
h2d,2

G = (V , E , C )
h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8

– V: vertex set {Vi, 1≤i≤N}

h2d,10 h2d,9 h2d,11

– E : a set of directed edges (i,j) h2d,12 h2d,13

– C: (i,j) → R+, edge cost functions h2d,14 h2d,15

• To encode probabilistic information
– An edge indicates a probabilistic
dependence
– C : P(Vi | Vj): conditional probability
function set
• The 2D and 3D BNs 
  
G2 D = (V2 D , E2 D , C2 D ) G3 D = (V3 D , E3 D , C3 D )


2D Graphical Model

V2 D = {H 2 D , O2 D } h2d,2

h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8

O2d : Nc
S
A
C h2d,9 h2d,8 h2d,10

C2 D = {P(h2 d ,i | pa (h2 d ,i ))} h2d,11 h2d,12

h2d,13 h2d,14


3D Graphical Model
h2d,3 h2d,1 h2d,9 h2d,4
V3 D = {H 3 D , O3 D }
h2d
hu3d,2 hu3d,1 hu3d,3
O3d :
h2d,5 hu3d,4 hu3d,5 h2d,6
Upper wN
body
h2d,7 hu3d,6 hu3d,7 h2d,8
L

h2d,10 h2d,9 h2d,11
C3 D = {P(h3d ,i | pa (h3d ,i ))}
hl3d,2 hl3d,1 hl3d,3

Lower
h2d,12 hl3d,4 hl3d,5 h2d,13
body
h2d,14 hl3d,6 hl3d,7 h2d,15


Joint Probability Distribution
(JPD)
• The two proposed graphical models
specify two unique JPDs:
P2D(V2D) and P3D(V3D)
• Let P(V) represent the two JPDs
n
h2d,2 P(V ) = ∏ P(Vi | pa (Vi ))
h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8
i =1
• The factorization of the JPD comes
h2d,9 h2d,8 h2d,10 from the Markov Blanket, a local
h2d,11 h2d,12
Markov property
• If we can learn the finite conditional
h2d,13 h2d,14

probabilities, we can inference the
human pose


Two Problems
• Training problem
– Given a training set : {O2d, O3d}
– How can we learn the edge cost function
C = { P(h | pa(h)) }
h2d,2

– We apply the EM algorithm
h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8

• Inference problem
– Given an evidence O h2d,9 h2d,8 h2d,10

– How can we inference h2d,11 h2d,12

the human pose
h2d,13 h2d,14

P(H | O) by P(V)
– We propose an annealed Gibbs sampling
algorithm


4. Model Learning by EM
• Why apply the EM algorithm for model
learning
– The human poses and observations are
incomplete and sparse
• Incomplete: occlusion due to single camera
• Sparse: small training samples in large-
dimension space


The Likelihood Function
• The training set D={D1,…DN}
– N represents the number of training samples
– Dl={V1[l],…,Vn[l]} is the l-th training sample
• Let θ be the learning model: C = { P(h | pa(h)) }
• θ = arg max P(θ | D) = arg max P( D | θ ) P ((θ )) = arg max P( D | θ )
ˆ P
D
θ θ θ

= arg max
θ
∏ P( D | θ )
l =1~ N
l

• A log-likelihood function LD (θ ) = log( P( D | θ )) is
formulated based on the independence
assumption of training samples
N 
LD (θ ) = log ∏ P(V1[l ],...,Vn [l ] | θ )
 l =1 
= ∑i =1 ∑l =1 log P(Vi [l ] | pai (Vi (l )),θ )
n N


MLE v.s. EM
• If D is complete, we can apply the MLE
(Maximum Likelihood Estimation) to
find θ
• However D is incomplete because of
occlusion and partial observability
• Let D=Y∪U h2d,2

h2d,7 h2d,5 h2d,3 h2d,1 h2d,4 h2d,6 h2d,8

– Y is observed data
– U is the missing data h2d,9 h2d,8 h2d,10

h2d,11 h2d,12

h2d,13 h2d,14


The EM
• Expectation Step
– Computes the expectation of
the log likelihood function
Q(θ | θ (t ) ) = Eθ ( t ) = [log P( D | θ ) | θ (t ) , Y ]
• Maximization Step
– Updates the t+1 step parameter θ(t+1) from
current parameter θ(t)
θ ( t +1)
= arg max Q(θ | θ ) (t )
θ
• Stop condition of the E-M steps iteration
– LD (θ (t +1) ) − LD (θ (t ) ) converges


5. Pose Estimation by
Approximate Inference
• Let the observed data be O'=O-U
– U is the set of hidden variables that are
unobservable due to occlusion
• The best estimated pose is a vector H*,
which is defined as the pose with the
maximum probability given O'.
H * = arg max P ( H | O' ) = arg max ∫ P( H , u | O' )du
u∈U n

= arg max ∫ P( H , O' , u )du = arg max ∫ ∏ P(V | pa(V ))
u∈U i =1
i i
u∈U
P(V) V= H ∪ O' ∪ U


Inference of Posterior Probability
• How to calculate the posterior
probability?
H * = arg max ∫ ∏ P(Vi | pa (Vi ))du
u∈U i =1...n

– Exact inference
• Junction tree, Message passing
– Approximate inference
• Loopy belief propagation , Variational method
• Markov chain Monte Carlo (MCMC) sampling
– Metropolis-Hasting
– Gibbs sampling


Approximate Inference (1/2)
• MCMC algorithm uses sampling theorem
• To approximate posterior distributions
P(V) by random number generation
• The key idea of MCMC is to simulate the
sampling process as a Markov chain
• Definition
• A sample vector v of V
• A proposal distribution q(v*|v(t-1)) to generate v*
• An acceptance distribution α to accept v* as v(t)
 p(v*)q(v (t −1) | v*) 
α (v ( t −1)
 p(v (t −1) )q (v* | v (t −1) ) 
, v*) = min1, 
 


Approximate Inference (2/2)
• MCMC will generate a Markov chain
(v(0), v(1), ..., v(k), ...), as the transition
probabilities from v(t-1) to v(t)
– Depends only on v(t-1)
– But not (v(0), v(1), ..., v(t-2))
• The chain approaches its stationary
distribution
– Samples from the vector (v(k+1), ..., v(k+n)) are
samples from P(V)
• However, if V is in high dimensions,
MCMC is not easy to converge


Annealed Gibbs Sampling (1/4)
• Gibbs sampling method
– Formally proposed by Geman&Geman in
1984 for Markov Random Field (MRF)
– Here the sampler is revised for the
proposed two-stage Bayesian network
– The basic idea
• Sampling uni-variate conditional
distributions
• That is, Markov chain of (v(0), v(1), ..., v(k),
...) is achieved by only changing one variable
of v


• We draw from the distribution
v (jt ) ~ P (V j | v1(t ) ,, v (jt−)1 , v (jt+)1 ,, vnt ) )
(

• The Annealed Gibbs (AG) sampler
– The uni-variate conditional distributions
sampling is controlled by a stochastic
process of simulated cooling
 p (v * | v−ij) ) if v− j = v−tj)
( * (
q (v* | v ( t ) ) =  j

 0 otherwise
 1

  p (v*)  T ( t ) q (v ( t ) | v*) 
α AG = min1,   j

 p (v ( t ) )  q (v* | v (jt ) ) 
  j 
 


• Function T(t) is called cooling
t
Tf n
schedule T (t ) = T0 ( )
T0
• The particular value of T at any point in
the chain is called the temperature
– T0 is start temperature
– Tf is the final cool down temperatures over
n step
• As the process proceeds, we decrease
the probability of such down-hill
moves


• The AG sampler adopts a stochastic iterative
algorithm that converges to the set of points
which are the global maxima of the given
function
• The advantage of the AG sampler is
– Its efficiency compared to the Gibbs sampler is
better
• Because Instead of approximating P(V)
– We want to find the global maximum, i.e., the ML
estimate of posterior distribution.
– We run a Markov chain of invariant distribution
P(V) and estimate only the global mode


6. Feature Extraction
• Human silhouette sampling

• Normalized width Width

Length

• Normalized center

• Spatial distribution of skin color

• Corners of silhouette


Human Silhouette Sampling (S)
• Human segmentation
• Human silhouette capturing [Suzuki, 1985]
• Uniform sampling is used in human
silhouette sampling.


Normalized Width (wN )
Normalization

• Human segmentation width

• Binary image profile
• Width adjust
wN = x R − x L
Profile of X coordinate

450

400

 hx ≥ threshold
350

pixel accumulation value
300

xL = x  for x = 1 → w
250

hx −1 < threshold
200

150

100

50

 hx ≥ threshold
0
0 100 200 300 400 500 600
x coordinate of image

xR = x  for x = w → 1
Width

hx +1 < threshold
Length

48


Normalized Center (Nc)
• Boundary adjustment
• Center of new boundary
x N = x p + 0.5wN

y N = y p + 0.5 L

Width

Length


Spatial Distribution of
Skin Color (A)
Skin color Morphology
detection by
GMM

Region Spatial distribution of
segment skin color


Corners of Silhouette (C)
• Human segmentation
• Human silhouette capturing
• The level curve curvature approach
[Lindeberg, 1998] ~
I ( x, y ) = arg max Dx D yy + D y Dxx − 2 Dx D y Dxy
2 2

• Adaptive corner choice


7. Experimental Results
• Experimental environment
– CPU:1.86G, RAM:1G, VC6.0
– HumanEva database I


HumanEva Database I
• Provider:
– Department of Computer Science in Brown Univ.
• Actions of HumanEva I
Action Description
Walking Subjects walked in an elliptical around
the capture space.
Jog Subjects jogged in an elliptical around
the capture space.
Gesture Subjects performed “hello”
and ”good-bye” gestures in repetition.
Throw/Ca Subjects tossed and caught a baseball
tch with the help of the lab assistant.
Box Subjects imitated boxing.
Combo Subjects performed combinational
actions of walking and jogging.


Environment Setting
BW1 BW2

• 7 cameras
– 3 color cameras
3m

( C1, C2, C3 )
C2 Capture Space
2m
C3 – 4 gray level cameras
( BW1, BW2, BW3, BW4 )

BW4 BW3
C1

Control Station


The Experimental Data
• Our proposed method has been trained by 1900
images from walking sequences of subjects 1 and 2
from C1
• 200 testing images:
• 100 images from subject 1
• 100 images from subject 2
• Difficulties:
– Self-occluding
– Clothe variation
– Large variation of
joint location


Evaluation of Accuracy
• Average distance error of poses
between estimated results and ground

• Let H = {h1, h2, ...hM}, where hm ∈ R3 (or xm ∈
truth

R2 for the 2D body model), be the position
vector of the body pose in the world (or
image respectively)
• D(H, H*): the error in estimated pose H* to
the ground truth pose H
M h −h 1 N T
*

D( H , H *) = ∑
m =1
m

M
m
ξ= ∑∑
NT n=1 t =1
D( H t ,n , H t*,n )


Performance Comparison Between
Two-stage and One-stage methods

• AG sampler performs better than the Gibbs sampler,
• Two-stage approach performs better than classical
one-stage approach
• AG sampler takes less inference time


Effect of Iteration Number
on Accuracy


2D Results of Subject 1
Frame: GT
AGs
Frame: GT
AGs
1122 1149

GT
Frame: GT
AGs
Frame: AGs
1172 1200


2D Results of Subject 2
GT GT
Frame: AGs
Frame: AGs
804 835

Frame: GT
AGs
Frame: GT
AGs
875 899


3D Results
• The 1110 frame of subject 1
Ground truth AGs estimation result
150 150

100 100

50 50

0 0

-50 -50
100 0 -100 100
100
-100 0 0 100
-100 0 -100


3D Results (Cont.)


150 150

100 100

50 50

0 0

-50 100 -50 100
100 0 100
0 0
0
-100 -100 -100 -100


3D Results (Cont.)

150 150

100 100

50 50

0 0

-50 -50
100 100
100 100
0 0 0 0
-100 -100 -100 -100


3D Results (Cont.)

150 150

100 100

50 50

0 0

-50 -50
100 100
100 100
0 0
0 0
-100 -100 -100 -100


8. Conclusions
• A markerless and monocular motion
capture problem is considered
• The proposed two-stage annealed Gibbs
sampling method can estimate more
accurate poses with less computation time
• The method can overcome three challenges
of the problem
– Self-occlusion
– High-degree variation of joint locations
– Clothing limitation


Future Work
• Use GMM to approximate prior and
posterior distribution of our human models
• Combine model-free method and model-
based methods to obtain benefits of both
• Exploit HMM to inference human motions
in time series
• Add human parts detectors to help locate
human joints

Wang, Yuan-Kai

本簡報授權聲明
• 此簡報內容採用 Creative Commons 「姓名標示 - 非商
業性台灣 3.0 版」授權條款
• 歡迎非商業目的的重製、散布或修改本簡報的內容，但
請標明： (1)原作者姓名：王元凱； (2)圖標示：
• 簡報中所取用的部份圖形創作乃截取自網際網路，僅供
演講者於自由軟體推廣演講時主張合理使用，請讀者不
得對其再行取用，除非您本身自忖亦符合主張合理使用
之情狀，且自負相關法律責任。

Monocular Human Pose Estimation with Bayesian Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Monocular Human Pose Estimation with Bayesian Networks

Ähnlich wie Monocular Human Pose Estimation with Bayesian Networks (6)

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Mehr von IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (13)

Monocular Human Pose Estimation with Bayesian Networks