Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
1. HACETTEPE UNIVERSITY
Presenter: A. Haje Karim
Date: 31.12.2018
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
Jonathan Tompson, Chris Bregler, Kevin Murphy
Google, Inc.
1
3. Motivation (Why computer vision?)
Figure 1 Human brain lobes (left) and evolution of eye (right)
3
https://qbi.uq.edu.au/brain/brain-anatomy/lobes-brain
https://www.pinterest.com/pin/565905509398223146/
4. Motivation (Why computer vision?)
Figure 2. Mars rover Curiosity
“When the next-generation Mars
rover, dubbed Curiosity, touches
down on martian soil next
summer, its cameras will likely
capture a scene similar to what
the first explorers of the Grand
Canyon witnessed: towering
layers of rock and sediment
rising up from a dusty
valley…..”[1]
4
http://news.mit.edu/2011/mars-rover-0725
5. Motivation (Why pose estimation?)
5
Figure 3 Computer vision and its direct relation to humans[2].
SurvHuman : https://static.independent.co.uk/s3fs-public/thumbnails/image/2013/01/01/20/pg-20-cctv-getty.jpg?w968h681
autCar: http://fortune.com/2016/08/30/self-driving-drive-ai/
factRobHum:https: //www.forbes.com/sites/alexknapp/2015/05/06/how-businesses-are-teaching-robots-new-tricks/#47cca6232859
6. Motivation (Why this paper?)
6
• No prior knowledge of the location and scale
• Multi-person and crowded scene
Why this
paper is
novel and
critical ?
7. Problem Definition (What is the problem?)
We would like to :
1. Detect Many people.
2. Pose Estimation.
7
Figure 4 An image taken from COCO 2018 Keypoint Detection Task
8. Estimate the 2-D localizations of human joints on the arms,
legs, and key points on torso and the face.
The method should work in case we
have many people in the scene.
The method should be able to work in a
cluttered environment.
8
Problem Definition (Clearly)
9. • A. Toshev and C. Szegedy. Deep pose: Human pose estimation via
deep neural networks. In CVPR, 2014
• They mainly used deep CNN
• You can see, it does not work for multi-person!
9
Pervious Work
Figure 5 Pose estimation results on images from LSP [4]
10. 10
Pervious Work
• A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR,
2014
• For every feature, get a score based on if it exists in the region or not
Figure 6 CNN structure used in this work [5]
11. 1. It performs end-to-end feature learning and are trained with the back-propagation algorithm.
2. It starts with a 64×64 pixel RGB input patch which has been local contrast normalized (LCN) to emphasize geometric
discontinuities and improve generalization performance.
3. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs) and
max-pooling.
4. Following the three stages of convolution and subsampling, the top-level pooled map is flattened to a vector and
processed by three fully connected layers, analogous to those used in deep neural networks.
5. The output layer has a single logistic unit, representing the probability of the body part being present in that patch
6. To train the convnet, we performed standard batch stochastic gradient descent.
7. From the training set images, we set aside a validation set to tune the network hyper-parameters, such as number
and size of features, learning rate, etc.
11
Pervious Work
12. Pervious Work
12
M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
Quite and old paper But Cannot be skipped !
The research in human pose estimation has been based on the idea of
part-based models, as pioneered by the Pictorial Structures (PS) model.
The majority of these methods focus on capturing rich dependencies
among body parts and properties.
13. Pervious Work
13
• Figure 7 Reference description of a face and schematic representation
of face reference, indicating components and their linkages [6]
The representation and matching of pictorial structures.
14. Approach (Some Theoretical Background)
• Definition: Human Pose Estimation is defined as 2-D localization of
human joints on the arms, legs, and keypoints on torso and the face.
• Definition: The degree of match between ground truth and predicted
poses is measured in terms of object keypoint similarity (OKS), which
ranges from 0 to 1.
• Definition: OKS-induced average precision (AP) metric is used to
measure the overall quality of the combined person detection and
pose estimation.
14
15. • To tackle multi-person pose estimation, we have two approaches:
1. Bottom-up: in which keypoint proposals are grouped together into
person instances.
2. Top-down: in which a pose estimator is applied to the output of a
bounding-box person detector.
15
Approach (Some Theoretical Background)
18. Approach (How to solve the problem?)
It consists of two stages:
1. They predict the location and scale of boxes which are likely to
contain people.
2. They estimate the keypoints of the person potentially contained in
each proposed bounding box.
18
19. Approach (Person Box Detection)
1. Faster-RCNN system
19
Figure 9 Faster-RCNN main architecture [7]
20. Approach (Person Box Detection)
20
1. ResNet-101 network backbone.
2. Atrous convolution to generate denser feature maps.
3. CNN backbone was pre-trained on ImageNet.
4. Region proposal and box classifier components of the Faster-RCNN
detector have been trained on COCO dataset (only person)
5. Faster-RCNN Tensorflow implantation
21. Approach (Person Pose Estimation)
21
• System predicts the location of all K = 17 person keypoints
• Input : person bounding box. (from first stage)
• Do not use single regressor but use activation maps. (multi-
predictions of the same keypoint)
• Problem with localization precision
Input Image
Localization
precision
Feature
map size
Activation
map
22. Approach (Person Pose Estimation)
22
• Solution : classification + regression approach.
• For each spatial position, we first classify whether it is in the vicinity
of each of the K keypoints or not (heatmap)
• Predict a 2-D local offset vector to get a more precise estimate of the
corresponding keypoint location.
23. Approach (Person Pose Estimation)
23
Figure 10. Heatmap target for the left-elbow keypoint (left and middle).
Offset field L2 magnitude (shown in grayscale) and 2-D offset vector
shown in red(right).
24. Approach (Image Cropping)
24
• All boxes same fixed aspect ratio.
• Further enlarge the boxes to include additional image context.
• Crop from the resulting box the image and resize to a fixed crop of
height 353 and width 257 pixels. (aspect ratio : 1.37)
25. Approach (Heatmap and Offset Prediction
with CNN)
25
• Apply ResNet with 101 layers on the cropped image
• For each keypoint generate heatmap, x-offset and y-offset.
• Learning transfer using Imagene pretrained ResNet-101 model
replacing its last layer with 1x1 convolution with 3K outputs.
• Use atrous convolution to generate the 3K predictions with an output
stride of 8 pixels and bilinearly up-sample them to the 353x257 crop
size.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
27. Approach (Final keypoint location)
27
• Aggregate the found results to the final location
• Each point (j) in the image crop gives us its suggestions to the location of
all the keypoints.
• It is adirect application of Hough voting.
Final
Heatmap of
the kth
mainpoint
Offset vector
from point j
to the k
mainpoint
Approximated
Heatmap of
the kth
mainpoint
Banalize
based on R
28. Approach (Mixing the results)
28
Figure 11. The convolutional network predicts two targets: heatmaps
and magnitude of the offset fields. Aggregating them in a weighted
voting process results in highly localized activation maps.
29. Approach (Model Training)
29
• Single RestNet with two output heads. (heatmap disk and offsets )
• Training images for heatmaps (on board)
• Loss function: 𝐿 𝜃 = 𝜆ℎ 𝐿ℎ 𝜃 + 𝜆 𝑜 𝐿 𝑜 𝜃
• The corresponding loss function 𝐿ℎ 𝜃 is the sum of logistic losses
for each position and keypoint separately.
• For training the offset regression head, we penalize the difference between
the predicted and ground truth offsets. The corresponding loss is
• Huber robust loss
30. Approach (Pose Rescoring)
30
• At test time, we compute a refined confidence estimate for a human
being estimated in the box.
• In particular, we maximize over locations and average over keypoints,
yielding our final instance-level pose
• Detection score:
31. • Non maximal suppression (NMS) to eliminate multiple detections in
the person-detector stage.
• IoU of the boxes (standard). What about keypoints ?
• Measure overlap using (OKS) for two candidate pose detections.
• High IOU-NMS threshold used (0.6 in our experiments) at the output
of the person box detector to filter highly overlapping boxes.
31
Approach (OKS-Based Non Max Suppression)
32. Evaluation (Experimental Setup)
32
• Tensorflow, distributed training, Tesla K40 GPUs.
• For person detector:
o9 GPUs.
oAsynchronous SGD with momentum set to 0.9.
oThe learning rate starts at 0.0003 and is decreased by a factor of 10 at
800k steps.
oTrain for 1M steps.
33. 33
• For pose estimator :
oTwo machines with 8 GPUs each.
oBatch size equal to 24 (3 crops per GPU times 8 GPUs).
o Fixed learning rate of 0.005
o Polyak-Ruppert parameter averaging, which amounts to using during
evaluation a running average of the parameters during training.
oTrain for 800k steps.
Evaluation (Experimental Setup)
34. • All networks are pre-trained on the Imagenet classification dataset.
• To train our system we use two dataset variants;
o One that uses only COCO data (COCO-only),
o One that appends COCO-only to an internal dataset (COCO + int).
• From the 66K images (273K person instances) in the COCO train + val
splits.
• From the 62K images (105K person instances) in COCO-only model
training and use the remaining 4,301 annotated images as mini-val
evaluation set.
• Our COCO + int = COCO-only + 73 K images from Flickr.
• This in-house dataset contains an additional 227K person instances.
34
Evaluation (Datasets)
35. • Faster-RCNN person box detection module trained exclusively on the
COCO-only dataset.
• Experimented training ResNet-based pose estimation module either
on the COCO-only or on the augmented COCO + int datasets (1:1
ratio) and present results for both.
• For COCO+int pose training we use mini-batches that contain COCO
and in-house annotation instances in 1:1 ratio.
35
Evaluation (Datasets)
42. Weaknesses
• They did not sourced the system yet (even so they promised!)
• The output of the first head passes through a sigmoid function
(heatmaps probability)
• 𝜆ℎ= 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function
terms(why these values!)
42
46. REFERENCES
[1] http://news.mit.edu/2011/mars-rover-0725
[2] www.forbes.com, static.independent.co.uk and fortune.com
[3] M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
[4] A. Toshev and C. Szegedy. Deep pose: Human pose estimation via deep
neural networks. In CVPR, 2014
[5] A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR, 2014
[6] M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
46
47. REFERENCES
[7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards
Real-Time object detection with region proposal networks.
In NIPS, 2015.
47
48. Extra Slides (IoU, OKS)
48
http://image-
net.org/challenges/talks/2016/ECCV2016_workshop_presentat
ion_keypoint.pdf
50. What is AP, AP.5, AP.75,
50http://cocodataset.org/#detection-eval
Editor's Notes
My name is Abdulrahman Haje Karim, M.Sc. student in Computer engineering’s department. My main interest is the intersection area between CV, CG and deep learning fields.
I will be presenting a brilliant work done by a google research group. The works titled with “Towards Accurate Multi-person Pose Estimation in the wild”
First of all, let us start with fancy introduction including my own and authors motivation then, let us examine the problem in our hands. After that let us study in details how they could solve the problem .
Then let us discuss the results they have achieved and let me conclude with my point of view including strengths and weakness of this work.
Finally, let us open a slot for your questions, suggestions and your point-of-views regarding this research paper.
well, Computer vision goes back to around 500 million years B.C where the first creatures on earth started to develop their visual system to better hide from enemies and search for food. Refer to introduction to deep learning done by Fie-Fie Li Stanford University
What is interesting more is that “half of the human brain is devoted directly or indirectly to vision”. Professor Mriganka Sur says from MIT dep. of brain and cognitive science.
Maybe you were surprised to see that strange photo from Mars while talking about our paper. You are right but let me share with you an article that talks about human exploration to Mars. You can see the first three lines: that the main concerns to scientist is that we will able to get a visual representation of that new planet which will lead us to more knowledge, assumptions and predictions.
Obviously,
A lot of real world problems started to find a solution with the help of computer vision. However, in most of the problems the human is present which forced computer vision scientists to care the most!
Human detection and pose estimation play a central role in almost all computer vision applications.
Some examples, are Human-Robot-Interaction (HRI), video surveillance self driving cars
Now we understand the importance of computer vision and its direct relation with humans
Now we understand the importance of computer vision and its direct relation with humans
Why to select this paper?
It is a hot topic in both virtual and augmented reality.
Many papers try to solve the pose estimation problem with the assumption that the location or/and scale of the person instances is given as a ground truth which simplifies the problem and does limit the generality of the solution.
Many papers tried to solve the previous problem for single person or where the people are far from each others. However, this paper tackle the problem when people are close to each other, which make it quite difficult to solve the association problem of determining which body part belongs to which person.
We would like to detect many people and to perform pose estimation on them.
We would like to detect many people and to perform pose estimation on them.(again)
Most of the work that was done, as mentioned before, was focusing on Single Person in the scene….where the algorithm fails for situation were many people exist in the scene.
LSP :
The Leeds Sports Pose dataset contains 2000 pose annotated images of mostly sports people gathered from Flickr using the tags shown above. The images have been scaled such that the most prominent person is roughly 150 pixels in length. Each image has been annotated with 14 joint locations. Left and right joints are consistently labelled from a person-centric viewpoint
Most of the work that was done, as mentioned before, was focusing on Single Person in the scene….where the algorithm fails for situation were many people exist in the scene.
1. Each part represents local visual properties.
2. “Springs” capture spatial relationships.
3. Matching model to image involves joint optimization of part locations “stretch and fit”
OKS can be thought as Intersection of Unions (IoU) for zero no match for 1 exact match.
Bottom-up : Detect body parts instead of full persons, then subsequently associate these parts to human instances, thus performing pose estimation in a bottom up fashion.
Such approaches employ part detectors and differ in how associations among parts are expressed, and the inference procedure used to obtain full part groupings into person instances.
Top-down : First perform person detection, then make the pose estimation.
The second one is going to be used in our paper.
First Stage : we employ a Faster-RCNN person detector to produce a bounding box around each candidate person instance.(proposal)
Second Stage : we apply a pose estimator to the image crop extracted around each candidate person instance in order to localize its keypoints and re-score the corresponding proposal. (a refinement )
(i) Go beyond bounding boxes and predict keypoints and
(ii) Rescore the detection based on the estimated keypoints.
To the second stage person box detection proposals with score higher than 0.3, resulting in only 3.5 proposals per image on average.
We slide the RPN (Region Proposal Networks) over the feature map and we generate k-proposals (with different sizes/aspect ratio ) with a score.
If score of anchor box is greater than 0.7 or max then accept.
If score of anchor box is less than 0.3 then reject.
Each sliding window is mapped to a lower-dimensional vector (256-d for ZF and 512-d for VGG).
This vector is fed into two sibling fully-connected layers—a box-regression layer (reg) and a box-classification layer (cls).
Atrous convolution to generate denser feature maps with output stride equal to 8 pixels instead of the default 32 pixels.
and the box annotations for the remaining 79 COCO categories have been ignored.
Solution to muli-person :
One approach would be to use a single regressor per keypoint, as in [45], but this is problematic when there is more than one person in the image patch (in which case a keypoint
can occur in multiple places). Logical solution is to predict activation maps, as in [27], which allow for multiple predictions of the same keypoint.
the size of the activation maps, and thus
the localization precision, is limited by the size of the net’s
output feature maps, which is a fraction of the input image
We use the accurate ResNet-101 (353x257) pose estimator with disk radius R = 25 pixels in the rest
of the paper.
Make all boxes have the same fixed aspect ratio ( by extending either the height or the width ) of the boxes returned by the person detector without distorting the image aspect ratio.
we use a rescaling factor equal to 1.25 during evaluation and a random rescaling factor between 1.0 and 1.5 during training (for data augmentation).
Aspect ratio = (353/257) = 1.37
in a fully convolutional fashion to produce heatmaps (one channel per keypoint) and offsets (two channels per keypoint for the x and y- directions) for a total of 3K output channels, where K = 17 is the number of keypoints.
Atrous to generate denser feature map.
k is the index of the keypoint.
Very hard to train, ok then let us approximate it.
This is a form of Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every keypoint,
Each vote is weighted by the probability that it is in the disk of influence of the corresponding keypoint.
The normalizing factor equals the area of the disk and ensures that if the heatmaps and offsets were perfect, then fk(xi) would be a unit-mass delta function centered at the position of the k-th keypoint.
G(.) is the bilinear interpolation kernel
Classifier and regressor ….classifier for heatmaps(hk(xi)) and regressor for offsets.
The training target for heatmaps is a map of zeros and ones, ones if you are in the vicinity of the main joint and zeroes otherwise.
𝜆 ℎ = 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function
Logistic loss is like hinge loss but logarithmic
𝑭 𝒌 is the predicted offset and 𝒍 𝒌 − 𝒙 𝒊 is the ground truth offset
When computing hearmap loss, we treat as positives only the disks around the keypoints of the foreground person and as negatives everything else, forcing the model to predict correctly the keypoints of the person in the center of the box
At test time, rather than just relying on the confidence from the person detector, we compute a refined confidence estimate, which takes into account the confidence of each keypoint.
Average the confidences of all estimated keypoints.
We have implemented out system in Tensorflow.
http://ruder.io/optimizing-gradient-descent/
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. gamma<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
Table 1 shows the COCO keypoint test-dev split performance of our system trained on COCO-only or trained on COCO+int datasets.
Table 2 shows the COCO keypoint test-standard split results of our model with the pose estimator trained on either COCO-only or COCO+int training set.
For simplicity and to facilitate reproducibility we do not utilize multi-scale evaluation or model ensembling in the Faster-RCNN person box detection stage. Using such
enhancements can further improve our results at the cost of significantly increased computation time.
Gamma is the momentum
Miu is the learning rate
Theta is the parameter
J(theta) is the score function
SGD: performs a parameter update for each training example