Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

HACETTEPE UNIVERSITY
Presenter: A. Haje Karim
Date: 31.12.2018
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
Jonathan Tompson, Chris Bregler, Kevin Murphy
Google, Inc.
1

Objectives
• Motivation
• Problem Definition
• Approach
• Evaluation
• Strengths and Weaknesses
• Discussion
• References
2

Motivation (Why computer vision?)
Figure 1 Human brain lobes (left) and evolution of eye (right)
3
https://qbi.uq.edu.au/brain/brain-anatomy/lobes-brain
https://www.pinterest.com/pin/565905509398223146/

Motivation (Why computer vision?)
Figure 2. Mars rover Curiosity
“When the next-generation Mars
rover, dubbed Curiosity, touches
down on martian soil next
summer, its cameras will likely
capture a scene similar to what
the first explorers of the Grand
Canyon witnessed: towering
layers of rock and sediment
rising up from a dusty
valley…..”[1]
4
http://news.mit.edu/2011/mars-rover-0725

Motivation (Why pose estimation?)
5
Figure 3 Computer vision and its direct relation to humans[2].
SurvHuman : https://static.independent.co.uk/s3fs-public/thumbnails/image/2013/01/01/20/pg-20-cctv-getty.jpg?w968h681
autCar: http://fortune.com/2016/08/30/self-driving-drive-ai/
factRobHum:https: //www.forbes.com/sites/alexknapp/2015/05/06/how-businesses-are-teaching-robots-new-tricks/#47cca6232859

Motivation (Why this paper?)
6
• No prior knowledge of the location and scale
• Multi-person and crowded scene
Why this
paper is
novel and
critical ?

Problem Definition (What is the problem?)
We would like to :
1. Detect Many people.
2. Pose Estimation.
7
Figure 4 An image taken from COCO 2018 Keypoint Detection Task

Estimate the 2-D localizations of human joints on the arms,
legs, and key points on torso and the face.
The method should work in case we
have many people in the scene.
The method should be able to work in a
cluttered environment.
8
Problem Definition (Clearly)

• A. Toshev and C. Szegedy. Deep pose: Human pose estimation via
deep neural networks. In CVPR, 2014
• They mainly used deep CNN
• You can see, it does not work for multi-person!
9
Pervious Work
Figure 5 Pose estimation results on images from LSP [4]

10
Pervious Work
• A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR,
2014
• For every feature, get a score based on if it exists in the region or not
Figure 6 CNN structure used in this work [5]

1. It performs end-to-end feature learning and are trained with the back-propagation algorithm.
2. It starts with a 64×64 pixel RGB input patch which has been local contrast normalized (LCN) to emphasize geometric
discontinuities and improve generalization performance.
3. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs) and
max-pooling.
4. Following the three stages of convolution and subsampling, the top-level pooled map is flattened to a vector and
processed by three fully connected layers, analogous to those used in deep neural networks.
5. The output layer has a single logistic unit, representing the probability of the body part being present in that patch
6. To train the convnet, we performed standard batch stochastic gradient descent.
7. From the training set images, we set aside a validation set to tune the network hyper-parameters, such as number
and size of features, learning rate, etc.
11
Pervious Work

Pervious Work
12
M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
Quite and old paper But Cannot be skipped !
The research in human pose estimation has been based on the idea of
part-based models, as pioneered by the Pictorial Structures (PS) model.
The majority of these methods focus on capturing rich dependencies
among body parts and properties.

Pervious Work
13
• Figure 7 Reference description of a face and schematic representation
of face reference, indicating components and their linkages [6]
The representation and matching of pictorial structures.

Approach (Some Theoretical Background)
• Definition: Human Pose Estimation is defined as 2-D localization of
human joints on the arms, legs, and keypoints on torso and the face.
• Definition: The degree of match between ground truth and predicted
poses is measured in terms of object keypoint similarity (OKS), which
ranges from 0 to 1.
• Definition: OKS-induced average precision (AP) metric is used to
measure the overall quality of the combined person detection and
pose estimation.
14

• To tackle multi-person pose estimation, we have two approaches:
1. Bottom-up: in which keypoint proposals are grouped together into
person instances.
2. Top-down: in which a pose estimator is applied to the output of a
bounding-box person detector.
15
Approach (Some Theoretical Background)

Approach (Block Diagram)
Person Box Detection Person Pose Estimation
16
proposal refinement

Approach (Two stage model overview)
17
Figure 8 The overall system

Approach (How to solve the problem?)
It consists of two stages:
1. They predict the location and scale of boxes which are likely to
contain people.
2. They estimate the keypoints of the person potentially contained in
each proposed bounding box.
18

Approach (Person Box Detection)
1. Faster-RCNN system
19
Figure 9 Faster-RCNN main architecture [7]

Approach (Person Box Detection)
20
1. ResNet-101 network backbone.
2. Atrous convolution to generate denser feature maps.
3. CNN backbone was pre-trained on ImageNet.
4. Region proposal and box classifier components of the Faster-RCNN
detector have been trained on COCO dataset (only person)
5. Faster-RCNN Tensorflow implantation

Approach (Person Pose Estimation)
21
• System predicts the location of all K = 17 person keypoints
• Input : person bounding box. (from first stage)
• Do not use single regressor but use activation maps. (multi-
predictions of the same keypoint)
• Problem with localization precision
Input Image
Localization
precision
Feature
map size
Activation
map

22
• Solution : classification + regression approach.
• For each spatial position, we first classify whether it is in the vicinity
of each of the K keypoints or not (heatmap)
• Predict a 2-D local offset vector to get a more precise estimate of the
corresponding keypoint location.

23
Figure 10. Heatmap target for the left-elbow keypoint (left and middle).
Offset field L2 magnitude (shown in grayscale) and 2-D offset vector
shown in red(right).

Approach (Image Cropping)
24
• All boxes same fixed aspect ratio.
• Further enlarge the boxes to include additional image context.
• Crop from the resulting box the image and resize to a fixed crop of
height 353 and width 257 pixels. (aspect ratio : 1.37)

Approach (Heatmap and Offset Prediction
with CNN)
25
• Apply ResNet with 101 layers on the cropped image
• For each keypoint generate heatmap, x-offset and y-offset.
• Learning transfer using Imagene pretrained ResNet-101 model
replacing its last layer with 1x1 convolution with 3K outputs.
• Use atrous convolution to generate the 3K predictions with an output
stride of 8 pixels and bilinearly up-sample them to the 353x257 crop
size.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

Approach (Heatmap/Offset Formula)
26
Generate heatmap:
• k ∈ {1,2,3,4,5,...,17}; K = 17 ;
• i ∈ {1,2,3,4,…,90721}; N = 353x257 = 90721
• 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑎𝑡 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖
𝑡ℎ𝑒𝑛 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0
• 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 𝑏𝑦 𝑅
𝑡ℎ𝑒𝑛 ℎ 𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0
Generate 2-D offset vector:
• 𝐹𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑘 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 − 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖

Approach (Final keypoint location)
27
• Aggregate the found results to the final location
• Each point (j) in the image crop gives us its suggestions to the location of
all the keypoints.
• It is adirect application of Hough voting.
Final
Heatmap of
the kth
mainpoint
Offset vector
from point j
to the k
mainpoint
Approximated
Heatmap of
the kth
mainpoint
Banalize
based on R

Approach (Mixing the results)
28
Figure 11. The convolutional network predicts two targets: heatmaps
and magnitude of the offset fields. Aggregating them in a weighted
voting process results in highly localized activation maps.

Approach (Model Training)
29
• Single RestNet with two output heads. (heatmap disk and offsets )
• Training images for heatmaps (on board)
• Loss function: 𝐿 𝜃 = 𝜆ℎ 𝐿ℎ 𝜃 + 𝜆 𝑜 𝐿 𝑜 𝜃
• The corresponding loss function 𝐿ℎ 𝜃 is the sum of logistic losses
for each position and keypoint separately.
• For training the offset regression head, we penalize the difference between
the predicted and ground truth offsets. The corresponding loss is
• Huber robust loss

Approach (Pose Rescoring)
30
• At test time, we compute a refined confidence estimate for a human
being estimated in the box.
• In particular, we maximize over locations and average over keypoints,
yielding our final instance-level pose
• Detection score:

• Non maximal suppression (NMS) to eliminate multiple detections in
the person-detector stage.
• IoU of the boxes (standard). What about keypoints ?
• Measure overlap using (OKS) for two candidate pose detections.
• High IOU-NMS threshold used (0.6 in our experiments) at the output
of the person box detector to filter highly overlapping boxes.
31
Approach (OKS-Based Non Max Suppression)

Evaluation (Experimental Setup)
32
• Tensorflow, distributed training, Tesla K40 GPUs.
• For person detector:
o9 GPUs.
oAsynchronous SGD with momentum set to 0.9.
oThe learning rate starts at 0.0003 and is decreased by a factor of 10 at
800k steps.
oTrain for 1M steps.

33
• For pose estimator :
oTwo machines with 8 GPUs each.
oBatch size equal to 24 (3 crops per GPU times 8 GPUs).
o Fixed learning rate of 0.005
o Polyak-Ruppert parameter averaging, which amounts to using during
evaluation a running average of the parameters during training.
oTrain for 800k steps.
Evaluation (Experimental Setup)

• All networks are pre-trained on the Imagenet classification dataset.
• To train our system we use two dataset variants;
o One that uses only COCO data (COCO-only),
o One that appends COCO-only to an internal dataset (COCO + int).
• From the 66K images (273K person instances) in the COCO train + val
splits.
• From the 62K images (105K person instances) in COCO-only model
training and use the remaining 4,301 annotated images as mini-val
evaluation set.
• Our COCO + int = COCO-only + 73 K images from Flickr.
• This in-house dataset contains an additional 227K person instances.
34
Evaluation (Datasets)

• Faster-RCNN person box detection module trained exclusively on the
COCO-only dataset.
• Experimented training ResNet-based pose estimation module either
on the COCO-only or on the augmented COCO + int datasets (1:1
ratio) and present results for both.
• For COCO+int pose training we use mini-batches that contain COCO
and in-house annotation instances in 1:1 ratio.
35
Evaluation (Datasets)

36
Evaluation (COCO Keypoints Detection State-
of-the-Art)

37
Evaluation (Results Visualization)
Figure 12. Results visualization of randomly picked image from COCO test-dev set

38

39

Ablation Study (Box detection and pose
estimation)
40

41
Ablation Study (OKS Non Maximum Suppression)

Weaknesses
• They did not sourced the system yet (even so they promised!)
• The output of the first head passes through a sigmoid function
(heatmaps probability)
• 𝜆ℎ= 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function
terms(why these values!)
42

Strengths
• Multi-person
• Clutter Environment.
• People Near each others.
• High Accuracy
• Reproducibility
43

Discussion
Feel free to ask any question
44

REFERENCES
[1] http://news.mit.edu/2011/mars-rover-0725
[2] www.forbes.com, static.independent.co.uk and fortune.com
[3] M. A. Fischler and R. Elschlager. The representation and matching of
[4] A. Toshev and C. Szegedy. Deep pose: Human pose estimation via deep
neural networks. In CVPR, 2014
[5] A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR, 2014
[6] M. A. Fischler and R. Elschlager. The representation and matching of
46

REFERENCES
[7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards
Real-Time object detection with region proposal networks.
In NIPS, 2015.
47

Extra Slides (IoU, OKS)
48
http://image-
net.org/challenges/talks/2016/ECCV2016_workshop_presentat
ion_keypoint.pdf

Extra Slides (SGDM)
49http://ruder.io/optimizing-gradient-descent/

What is AP, AP.5, AP.75,
50http://cocodataset.org/#detection-eval

Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

Similar to Towards Accurate Multi-person Pose Estimation in the Wild (My summery) (20)

More from Abdulrahman Kerim

More from Abdulrahman Kerim (6)

Recently uploaded

Recently uploaded (20)

Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

Editor's Notes