SlideShare a Scribd company logo
1 of 50
HACETTEPE UNIVERSITY
Presenter: A. Haje Karim
Date: 31.12.2018
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
Jonathan Tompson, Chris Bregler, Kevin Murphy
Google, Inc.
1
Objectives
• Motivation
• Problem Definition
• Approach
• Evaluation
• Strengths and Weaknesses
• Discussion
• References
2
Motivation (Why computer vision?)
Figure 1 Human brain lobes (left) and evolution of eye (right)
3
https://qbi.uq.edu.au/brain/brain-anatomy/lobes-brain
https://www.pinterest.com/pin/565905509398223146/
Motivation (Why computer vision?)
Figure 2. Mars rover Curiosity
“When the next-generation Mars
rover, dubbed Curiosity, touches
down on martian soil next
summer, its cameras will likely
capture a scene similar to what
the first explorers of the Grand
Canyon witnessed: towering
layers of rock and sediment
rising up from a dusty
valley…..”[1]
4
http://news.mit.edu/2011/mars-rover-0725
Motivation (Why pose estimation?)
5
Figure 3 Computer vision and its direct relation to humans[2].
SurvHuman : https://static.independent.co.uk/s3fs-public/thumbnails/image/2013/01/01/20/pg-20-cctv-getty.jpg?w968h681
autCar: http://fortune.com/2016/08/30/self-driving-drive-ai/
factRobHum:https: //www.forbes.com/sites/alexknapp/2015/05/06/how-businesses-are-teaching-robots-new-tricks/#47cca6232859
Motivation (Why this paper?)
6
• No prior knowledge of the location and scale
• Multi-person and crowded scene
Why this
paper is
novel and
critical ?
Problem Definition (What is the problem?)
We would like to :
1. Detect Many people.
2. Pose Estimation.
7
Figure 4 An image taken from COCO 2018 Keypoint Detection Task
Estimate the 2-D localizations of human joints on the arms,
legs, and key points on torso and the face.
The method should work in case we
have many people in the scene.
The method should be able to work in a
cluttered environment.
8
Problem Definition (Clearly)
• A. Toshev and C. Szegedy. Deep pose: Human pose estimation via
deep neural networks. In CVPR, 2014
• They mainly used deep CNN
• You can see, it does not work for multi-person!
9
Pervious Work
Figure 5 Pose estimation results on images from LSP [4]
10
Pervious Work
• A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR,
2014
• For every feature, get a score based on if it exists in the region or not
Figure 6 CNN structure used in this work [5]
1. It performs end-to-end feature learning and are trained with the back-propagation algorithm.
2. It starts with a 64×64 pixel RGB input patch which has been local contrast normalized (LCN) to emphasize geometric
discontinuities and improve generalization performance.
3. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs) and
max-pooling.
4. Following the three stages of convolution and subsampling, the top-level pooled map is flattened to a vector and
processed by three fully connected layers, analogous to those used in deep neural networks.
5. The output layer has a single logistic unit, representing the probability of the body part being present in that patch
6. To train the convnet, we performed standard batch stochastic gradient descent.
7. From the training set images, we set aside a validation set to tune the network hyper-parameters, such as number
and size of features, learning rate, etc.
11
Pervious Work
Pervious Work
12
M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
Quite and old paper But Cannot be skipped !
The research in human pose estimation has been based on the idea of
part-based models, as pioneered by the Pictorial Structures (PS) model.
The majority of these methods focus on capturing rich dependencies
among body parts and properties.
Pervious Work
13
• Figure 7 Reference description of a face and schematic representation
of face reference, indicating components and their linkages [6]
The representation and matching of pictorial structures.
Approach (Some Theoretical Background)
• Definition: Human Pose Estimation is defined as 2-D localization of
human joints on the arms, legs, and keypoints on torso and the face.
• Definition: The degree of match between ground truth and predicted
poses is measured in terms of object keypoint similarity (OKS), which
ranges from 0 to 1.
• Definition: OKS-induced average precision (AP) metric is used to
measure the overall quality of the combined person detection and
pose estimation.
14
• To tackle multi-person pose estimation, we have two approaches:
1. Bottom-up: in which keypoint proposals are grouped together into
person instances.
2. Top-down: in which a pose estimator is applied to the output of a
bounding-box person detector.
15
Approach (Some Theoretical Background)
Approach (Block Diagram)
Person Box Detection Person Pose Estimation
16
proposal refinement
Approach (Two stage model overview)
17
Figure 8 The overall system
Approach (How to solve the problem?)
It consists of two stages:
1. They predict the location and scale of boxes which are likely to
contain people.
2. They estimate the keypoints of the person potentially contained in
each proposed bounding box.
18
Approach (Person Box Detection)
1. Faster-RCNN system
19
Figure 9 Faster-RCNN main architecture [7]
Approach (Person Box Detection)
20
1. ResNet-101 network backbone.
2. Atrous convolution to generate denser feature maps.
3. CNN backbone was pre-trained on ImageNet.
4. Region proposal and box classifier components of the Faster-RCNN
detector have been trained on COCO dataset (only person)
5. Faster-RCNN Tensorflow implantation
Approach (Person Pose Estimation)
21
• System predicts the location of all K = 17 person keypoints
• Input : person bounding box. (from first stage)
• Do not use single regressor but use activation maps. (multi-
predictions of the same keypoint)
• Problem with localization precision
Input Image
Localization
precision
Feature
map size
Activation
map
Approach (Person Pose Estimation)
22
• Solution : classification + regression approach.
• For each spatial position, we first classify whether it is in the vicinity
of each of the K keypoints or not (heatmap)
• Predict a 2-D local offset vector to get a more precise estimate of the
corresponding keypoint location.
Approach (Person Pose Estimation)
23
Figure 10. Heatmap target for the left-elbow keypoint (left and middle).
Offset field L2 magnitude (shown in grayscale) and 2-D offset vector
shown in red(right).
Approach (Image Cropping)
24
• All boxes same fixed aspect ratio.
• Further enlarge the boxes to include additional image context.
• Crop from the resulting box the image and resize to a fixed crop of
height 353 and width 257 pixels. (aspect ratio : 1.37)
Approach (Heatmap and Offset Prediction
with CNN)
25
• Apply ResNet with 101 layers on the cropped image
• For each keypoint generate heatmap, x-offset and y-offset.
• Learning transfer using Imagene pretrained ResNet-101 model
replacing its last layer with 1x1 convolution with 3K outputs.
• Use atrous convolution to generate the 3K predictions with an output
stride of 8 pixels and bilinearly up-sample them to the 353x257 crop
size.
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
Approach (Heatmap/Offset Formula)
26
Generate heatmap:
• k ∈ {1,2,3,4,5,...,17}; K = 17 ;
• i ∈ {1,2,3,4,…,90721}; N = 353x257 = 90721
• 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑎𝑡 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖
𝑡ℎ𝑒𝑛 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0
• 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 𝑏𝑦 𝑅
𝑡ℎ𝑒𝑛 ℎ 𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0
Generate 2-D offset vector:
• 𝐹𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑘 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 − 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖
Approach (Final keypoint location)
27
• Aggregate the found results to the final location
• Each point (j) in the image crop gives us its suggestions to the location of
all the keypoints.
• It is adirect application of Hough voting.
Final
Heatmap of
the kth
mainpoint
Offset vector
from point j
to the k
mainpoint
Approximated
Heatmap of
the kth
mainpoint
Banalize
based on R
Approach (Mixing the results)
28
Figure 11. The convolutional network predicts two targets: heatmaps
and magnitude of the offset fields. Aggregating them in a weighted
voting process results in highly localized activation maps.
Approach (Model Training)
29
• Single RestNet with two output heads. (heatmap disk and offsets )
• Training images for heatmaps (on board)
• Loss function: 𝐿 𝜃 = 𝜆ℎ 𝐿ℎ 𝜃 + 𝜆 𝑜 𝐿 𝑜 𝜃
• The corresponding loss function 𝐿ℎ 𝜃 is the sum of logistic losses
for each position and keypoint separately.
• For training the offset regression head, we penalize the difference between
the predicted and ground truth offsets. The corresponding loss is
• Huber robust loss
Approach (Pose Rescoring)
30
• At test time, we compute a refined confidence estimate for a human
being estimated in the box.
• In particular, we maximize over locations and average over keypoints,
yielding our final instance-level pose
• Detection score:
• Non maximal suppression (NMS) to eliminate multiple detections in
the person-detector stage.
• IoU of the boxes (standard). What about keypoints ?
• Measure overlap using (OKS) for two candidate pose detections.
• High IOU-NMS threshold used (0.6 in our experiments) at the output
of the person box detector to filter highly overlapping boxes.
31
Approach (OKS-Based Non Max Suppression)
Evaluation (Experimental Setup)
32
• Tensorflow, distributed training, Tesla K40 GPUs.
• For person detector:
o9 GPUs.
oAsynchronous SGD with momentum set to 0.9.
oThe learning rate starts at 0.0003 and is decreased by a factor of 10 at
800k steps.
oTrain for 1M steps.
33
• For pose estimator :
oTwo machines with 8 GPUs each.
oBatch size equal to 24 (3 crops per GPU times 8 GPUs).
o Fixed learning rate of 0.005
o Polyak-Ruppert parameter averaging, which amounts to using during
evaluation a running average of the parameters during training.
oTrain for 800k steps.
Evaluation (Experimental Setup)
• All networks are pre-trained on the Imagenet classification dataset.
• To train our system we use two dataset variants;
o One that uses only COCO data (COCO-only),
o One that appends COCO-only to an internal dataset (COCO + int).
• From the 66K images (273K person instances) in the COCO train + val
splits.
• From the 62K images (105K person instances) in COCO-only model
training and use the remaining 4,301 annotated images as mini-val
evaluation set.
• Our COCO + int = COCO-only + 73 K images from Flickr.
• This in-house dataset contains an additional 227K person instances.
34
Evaluation (Datasets)
• Faster-RCNN person box detection module trained exclusively on the
COCO-only dataset.
• Experimented training ResNet-based pose estimation module either
on the COCO-only or on the augmented COCO + int datasets (1:1
ratio) and present results for both.
• For COCO+int pose training we use mini-batches that contain COCO
and in-house annotation instances in 1:1 ratio.
35
Evaluation (Datasets)
36
Evaluation (COCO Keypoints Detection State-
of-the-Art)
37
Evaluation (Results Visualization)
Figure 12. Results visualization of randomly picked image from COCO test-dev set
38
Evaluation (Results Visualization)
Figure 13. Results visualization of randomly picked image from COCO test-dev set
39
Evaluation (Results Visualization)
Figure 14. Results visualization of randomly picked image from COCO test-dev set
Ablation Study (Box detection and pose
estimation)
40
41
Ablation Study (OKS Non Maximum Suppression)
Weaknesses
• They did not sourced the system yet (even so they promised!)
• The output of the first head passes through a sigmoid function
(heatmaps probability)
• 𝜆ℎ= 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function
terms(why these values!)
42
Strengths
• Multi-person
• Clutter Environment.
• People Near each others.
• High Accuracy
• Reproducibility
43
Discussion
Feel free to ask any question
44
45
REFERENCES
[1] http://news.mit.edu/2011/mars-rover-0725
[2] www.forbes.com, static.independent.co.uk and fortune.com
[3] M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
[4] A. Toshev and C. Szegedy. Deep pose: Human pose estimation via deep
neural networks. In CVPR, 2014
[5] A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning
human pose estimation features with convolutional networks. In ICLR, 2014
[6] M. A. Fischler and R. Elschlager. The representation and matching of
pictorial structures. In IEEE TOC, 1973.
46
REFERENCES
[7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards
Real-Time object detection with region proposal networks.
In NIPS, 2015.
47
Extra Slides (IoU, OKS)
48
http://image-
net.org/challenges/talks/2016/ECCV2016_workshop_presentat
ion_keypoint.pdf
Extra Slides (SGDM)
49http://ruder.io/optimizing-gradient-descent/
What is AP, AP.5, AP.75,
50http://cocodataset.org/#detection-eval

More Related Content

What's hot

Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution OverviewLEE HOSEONG
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimationtaeseon ryu
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural NetworksYogendra Tamang
 
Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide威智 黃
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)nikhilus85
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015Jia-Bin Huang
 
[DL輪読会]End-to-end Recovery of Human Shape and Pose
[DL輪読会]End-to-end Recovery of Human Shape and Pose[DL輪読会]End-to-end Recovery of Human Shape and Pose
[DL輪読会]End-to-end Recovery of Human Shape and PoseDeep Learning JP
 
Pixel Recurrent Neural Networks
Pixel Recurrent Neural NetworksPixel Recurrent Neural Networks
Pixel Recurrent Neural Networksneouyghur
 
Human activity recognition
Human activity recognition Human activity recognition
Human activity recognition srikanthgadam
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionBrodmann17
 
Deep belief network.pptx
Deep belief network.pptxDeep belief network.pptx
Deep belief network.pptxSushilAcharya18
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Towards Light-weight and Real-time Line Segment Detection
Towards Light-weight and Real-time Line Segment DetectionTowards Light-weight and Real-time Line Segment Detection
Towards Light-weight and Real-time Line Segment DetectionByung Soo Ko
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionPetteriTeikariPhD
 
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...Deep Learning JP
 

What's hot (20)

Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
[DL輪読会]End-to-end Recovery of Human Shape and Pose
[DL輪読会]End-to-end Recovery of Human Shape and Pose[DL輪読会]End-to-end Recovery of Human Shape and Pose
[DL輪読会]End-to-end Recovery of Human Shape and Pose
 
Pixel Recurrent Neural Networks
Pixel Recurrent Neural NetworksPixel Recurrent Neural Networks
Pixel Recurrent Neural Networks
 
Human activity recognition
Human activity recognition Human activity recognition
Human activity recognition
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
 
Deep belief network.pptx
Deep belief network.pptxDeep belief network.pptx
Deep belief network.pptx
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Towards Light-weight and Real-time Line Segment Detection
Towards Light-weight and Real-time Line Segment DetectionTowards Light-weight and Real-time Line Segment Detection
Towards Light-weight and Real-time Line Segment Detection
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
 

Similar to Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

Human action recognition with kinect using a joint motion descriptor
Human action recognition with kinect using a joint motion descriptorHuman action recognition with kinect using a joint motion descriptor
Human action recognition with kinect using a joint motion descriptorSoma Boubou
 
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...IJERA Editor
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Poster
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - PosterMediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Poster
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Postermultimediaeval
 
Feature extraction based retrieval of
Feature extraction based retrieval ofFeature extraction based retrieval of
Feature extraction based retrieval ofijcsity
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Ayan Das
 
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)Tatsunori Taniai
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...inside-BigData.com
 
Super Resolution of Image
Super Resolution of ImageSuper Resolution of Image
Super Resolution of ImageSatheesh K
 
Using HOG Descriptors on Superpixels for Human Detection of UAV Imagery
Using HOG Descriptors on Superpixels for Human Detection of UAV ImageryUsing HOG Descriptors on Superpixels for Human Detection of UAV Imagery
Using HOG Descriptors on Superpixels for Human Detection of UAV ImageryWai Nwe Tun
 
Henrik Christensen - Vision for co-robot applications
Henrik Christensen  -  Vision for co-robot applicationsHenrik Christensen  -  Vision for co-robot applications
Henrik Christensen - Vision for co-robot applicationsDaniel Huber
 
Henrik Christensen - Vision for Co-robot Applications
Henrik Christensen - Vision for Co-robot ApplicationsHenrik Christensen - Vision for Co-robot Applications
Henrik Christensen - Vision for Co-robot ApplicationsDaniel Huber
 
20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptxssuser7807522
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 

Similar to Towards Accurate Multi-person Pose Estimation in the Wild (My summery) (20)

Human action recognition with kinect using a joint motion descriptor
Human action recognition with kinect using a joint motion descriptorHuman action recognition with kinect using a joint motion descriptor
Human action recognition with kinect using a joint motion descriptor
 
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...
A Hybrid Technique for the Automated Segmentation of Corpus Callosum in Midsa...
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Poster
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - PosterMediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Poster
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Poster
 
Feature extraction based retrieval of
Feature extraction based retrieval ofFeature extraction based retrieval of
Feature extraction based retrieval of
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Densebox
DenseboxDensebox
Densebox
 
Final Poster
Final PosterFinal Poster
Final Poster
 
PPT s12-machine vision-s2
PPT s12-machine vision-s2PPT s12-machine vision-s2
PPT s12-machine vision-s2
 
Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020
 
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
 
Super Resolution of Image
Super Resolution of ImageSuper Resolution of Image
Super Resolution of Image
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Using HOG Descriptors on Superpixels for Human Detection of UAV Imagery
Using HOG Descriptors on Superpixels for Human Detection of UAV ImageryUsing HOG Descriptors on Superpixels for Human Detection of UAV Imagery
Using HOG Descriptors on Superpixels for Human Detection of UAV Imagery
 
V2 v posenet
V2 v posenetV2 v posenet
V2 v posenet
 
Henrik Christensen - Vision for co-robot applications
Henrik Christensen  -  Vision for co-robot applicationsHenrik Christensen  -  Vision for co-robot applications
Henrik Christensen - Vision for co-robot applications
 
Henrik Christensen - Vision for Co-robot Applications
Henrik Christensen - Vision for Co-robot ApplicationsHenrik Christensen - Vision for Co-robot Applications
Henrik Christensen - Vision for Co-robot Applications
 
20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx20230213_ComputerVision_연구.pptx
20230213_ComputerVision_연구.pptx
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 

More from Abdulrahman Kerim

Dream Big | CENG101 | A Kerim
Dream Big | CENG101 | A KerimDream Big | CENG101 | A Kerim
Dream Big | CENG101 | A KerimAbdulrahman Kerim
 
Synthetic training data for deep cn ns in reidentification
Synthetic training data for deep cn ns in reidentificationSynthetic training data for deep cn ns in reidentification
Synthetic training data for deep cn ns in reidentificationAbdulrahman Kerim
 
A naturalistic open source movie for optical flow evaluation
A naturalistic open source movie for optical flow evaluationA naturalistic open source movie for optical flow evaluation
A naturalistic open source movie for optical flow evaluationAbdulrahman Kerim
 
Augmented reality meets computer vision data generation for driving scenes.
Augmented reality meets computer vision data generation for driving scenes.  Augmented reality meets computer vision data generation for driving scenes.
Augmented reality meets computer vision data generation for driving scenes. Abdulrahman Kerim
 

More from Abdulrahman Kerim (6)

Dream Big | CENG101 | A Kerim
Dream Big | CENG101 | A KerimDream Big | CENG101 | A Kerim
Dream Big | CENG101 | A Kerim
 
Synthetic training data for deep cn ns in reidentification
Synthetic training data for deep cn ns in reidentificationSynthetic training data for deep cn ns in reidentification
Synthetic training data for deep cn ns in reidentification
 
A naturalistic open source movie for optical flow evaluation
A naturalistic open source movie for optical flow evaluationA naturalistic open source movie for optical flow evaluation
A naturalistic open source movie for optical flow evaluation
 
Augmented reality meets computer vision data generation for driving scenes.
Augmented reality meets computer vision data generation for driving scenes.  Augmented reality meets computer vision data generation for driving scenes.
Augmented reality meets computer vision data generation for driving scenes.
 
LASER Communication
LASER CommunicationLASER Communication
LASER Communication
 
Zaire ebolavirus
Zaire ebolavirusZaire ebolavirus
Zaire ebolavirus
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 

Towards Accurate Multi-person Pose Estimation in the Wild (My summery)

  • 1. HACETTEPE UNIVERSITY Presenter: A. Haje Karim Date: 31.12.2018 George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy Google, Inc. 1
  • 2. Objectives • Motivation • Problem Definition • Approach • Evaluation • Strengths and Weaknesses • Discussion • References 2
  • 3. Motivation (Why computer vision?) Figure 1 Human brain lobes (left) and evolution of eye (right) 3 https://qbi.uq.edu.au/brain/brain-anatomy/lobes-brain https://www.pinterest.com/pin/565905509398223146/
  • 4. Motivation (Why computer vision?) Figure 2. Mars rover Curiosity “When the next-generation Mars rover, dubbed Curiosity, touches down on martian soil next summer, its cameras will likely capture a scene similar to what the first explorers of the Grand Canyon witnessed: towering layers of rock and sediment rising up from a dusty valley…..”[1] 4 http://news.mit.edu/2011/mars-rover-0725
  • 5. Motivation (Why pose estimation?) 5 Figure 3 Computer vision and its direct relation to humans[2]. SurvHuman : https://static.independent.co.uk/s3fs-public/thumbnails/image/2013/01/01/20/pg-20-cctv-getty.jpg?w968h681 autCar: http://fortune.com/2016/08/30/self-driving-drive-ai/ factRobHum:https: //www.forbes.com/sites/alexknapp/2015/05/06/how-businesses-are-teaching-robots-new-tricks/#47cca6232859
  • 6. Motivation (Why this paper?) 6 • No prior knowledge of the location and scale • Multi-person and crowded scene Why this paper is novel and critical ?
  • 7. Problem Definition (What is the problem?) We would like to : 1. Detect Many people. 2. Pose Estimation. 7 Figure 4 An image taken from COCO 2018 Keypoint Detection Task
  • 8. Estimate the 2-D localizations of human joints on the arms, legs, and key points on torso and the face. The method should work in case we have many people in the scene. The method should be able to work in a cluttered environment. 8 Problem Definition (Clearly)
  • 9. • A. Toshev and C. Szegedy. Deep pose: Human pose estimation via deep neural networks. In CVPR, 2014 • They mainly used deep CNN • You can see, it does not work for multi-person! 9 Pervious Work Figure 5 Pose estimation results on images from LSP [4]
  • 10. 10 Pervious Work • A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. In ICLR, 2014 • For every feature, get a score based on if it exists in the region or not Figure 6 CNN structure used in this work [5]
  • 11. 1. It performs end-to-end feature learning and are trained with the back-propagation algorithm. 2. It starts with a 64×64 pixel RGB input patch which has been local contrast normalized (LCN) to emphasize geometric discontinuities and improve generalization performance. 3. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs) and max-pooling. 4. Following the three stages of convolution and subsampling, the top-level pooled map is flattened to a vector and processed by three fully connected layers, analogous to those used in deep neural networks. 5. The output layer has a single logistic unit, representing the probability of the body part being present in that patch 6. To train the convnet, we performed standard batch stochastic gradient descent. 7. From the training set images, we set aside a validation set to tune the network hyper-parameters, such as number and size of features, learning rate, etc. 11 Pervious Work
  • 12. Pervious Work 12 M. A. Fischler and R. Elschlager. The representation and matching of pictorial structures. In IEEE TOC, 1973. Quite and old paper But Cannot be skipped ! The research in human pose estimation has been based on the idea of part-based models, as pioneered by the Pictorial Structures (PS) model. The majority of these methods focus on capturing rich dependencies among body parts and properties.
  • 13. Pervious Work 13 • Figure 7 Reference description of a face and schematic representation of face reference, indicating components and their linkages [6] The representation and matching of pictorial structures.
  • 14. Approach (Some Theoretical Background) • Definition: Human Pose Estimation is defined as 2-D localization of human joints on the arms, legs, and keypoints on torso and the face. • Definition: The degree of match between ground truth and predicted poses is measured in terms of object keypoint similarity (OKS), which ranges from 0 to 1. • Definition: OKS-induced average precision (AP) metric is used to measure the overall quality of the combined person detection and pose estimation. 14
  • 15. • To tackle multi-person pose estimation, we have two approaches: 1. Bottom-up: in which keypoint proposals are grouped together into person instances. 2. Top-down: in which a pose estimator is applied to the output of a bounding-box person detector. 15 Approach (Some Theoretical Background)
  • 16. Approach (Block Diagram) Person Box Detection Person Pose Estimation 16 proposal refinement
  • 17. Approach (Two stage model overview) 17 Figure 8 The overall system
  • 18. Approach (How to solve the problem?) It consists of two stages: 1. They predict the location and scale of boxes which are likely to contain people. 2. They estimate the keypoints of the person potentially contained in each proposed bounding box. 18
  • 19. Approach (Person Box Detection) 1. Faster-RCNN system 19 Figure 9 Faster-RCNN main architecture [7]
  • 20. Approach (Person Box Detection) 20 1. ResNet-101 network backbone. 2. Atrous convolution to generate denser feature maps. 3. CNN backbone was pre-trained on ImageNet. 4. Region proposal and box classifier components of the Faster-RCNN detector have been trained on COCO dataset (only person) 5. Faster-RCNN Tensorflow implantation
  • 21. Approach (Person Pose Estimation) 21 • System predicts the location of all K = 17 person keypoints • Input : person bounding box. (from first stage) • Do not use single regressor but use activation maps. (multi- predictions of the same keypoint) • Problem with localization precision Input Image Localization precision Feature map size Activation map
  • 22. Approach (Person Pose Estimation) 22 • Solution : classification + regression approach. • For each spatial position, we first classify whether it is in the vicinity of each of the K keypoints or not (heatmap) • Predict a 2-D local offset vector to get a more precise estimate of the corresponding keypoint location.
  • 23. Approach (Person Pose Estimation) 23 Figure 10. Heatmap target for the left-elbow keypoint (left and middle). Offset field L2 magnitude (shown in grayscale) and 2-D offset vector shown in red(right).
  • 24. Approach (Image Cropping) 24 • All boxes same fixed aspect ratio. • Further enlarge the boxes to include additional image context. • Crop from the resulting box the image and resize to a fixed crop of height 353 and width 257 pixels. (aspect ratio : 1.37)
  • 25. Approach (Heatmap and Offset Prediction with CNN) 25 • Apply ResNet with 101 layers on the cropped image • For each keypoint generate heatmap, x-offset and y-offset. • Learning transfer using Imagene pretrained ResNet-101 model replacing its last layer with 1x1 convolution with 3K outputs. • Use atrous convolution to generate the 3K predictions with an output stride of 8 pixels and bilinearly up-sample them to the 353x257 crop size. https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
  • 26. Approach (Heatmap/Offset Formula) 26 Generate heatmap: • k ∈ {1,2,3,4,5,...,17}; K = 17 ; • i ∈ {1,2,3,4,…,90721}; N = 353x257 = 90721 • 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑎𝑡 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 𝑡ℎ𝑒𝑛 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0 • 𝑖𝑓 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 𝑘 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑟 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 𝑏𝑦 𝑅 𝑡ℎ𝑒𝑛 ℎ 𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 1 𝑒𝑙𝑠𝑒 𝑓𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 0 Generate 2-D offset vector: • 𝐹𝑘 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑘 𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡 − 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑖
  • 27. Approach (Final keypoint location) 27 • Aggregate the found results to the final location • Each point (j) in the image crop gives us its suggestions to the location of all the keypoints. • It is adirect application of Hough voting. Final Heatmap of the kth mainpoint Offset vector from point j to the k mainpoint Approximated Heatmap of the kth mainpoint Banalize based on R
  • 28. Approach (Mixing the results) 28 Figure 11. The convolutional network predicts two targets: heatmaps and magnitude of the offset fields. Aggregating them in a weighted voting process results in highly localized activation maps.
  • 29. Approach (Model Training) 29 • Single RestNet with two output heads. (heatmap disk and offsets ) • Training images for heatmaps (on board) • Loss function: 𝐿 𝜃 = 𝜆ℎ 𝐿ℎ 𝜃 + 𝜆 𝑜 𝐿 𝑜 𝜃 • The corresponding loss function 𝐿ℎ 𝜃 is the sum of logistic losses for each position and keypoint separately. • For training the offset regression head, we penalize the difference between the predicted and ground truth offsets. The corresponding loss is • Huber robust loss
  • 30. Approach (Pose Rescoring) 30 • At test time, we compute a refined confidence estimate for a human being estimated in the box. • In particular, we maximize over locations and average over keypoints, yielding our final instance-level pose • Detection score:
  • 31. • Non maximal suppression (NMS) to eliminate multiple detections in the person-detector stage. • IoU of the boxes (standard). What about keypoints ? • Measure overlap using (OKS) for two candidate pose detections. • High IOU-NMS threshold used (0.6 in our experiments) at the output of the person box detector to filter highly overlapping boxes. 31 Approach (OKS-Based Non Max Suppression)
  • 32. Evaluation (Experimental Setup) 32 • Tensorflow, distributed training, Tesla K40 GPUs. • For person detector: o9 GPUs. oAsynchronous SGD with momentum set to 0.9. oThe learning rate starts at 0.0003 and is decreased by a factor of 10 at 800k steps. oTrain for 1M steps.
  • 33. 33 • For pose estimator : oTwo machines with 8 GPUs each. oBatch size equal to 24 (3 crops per GPU times 8 GPUs). o Fixed learning rate of 0.005 o Polyak-Ruppert parameter averaging, which amounts to using during evaluation a running average of the parameters during training. oTrain for 800k steps. Evaluation (Experimental Setup)
  • 34. • All networks are pre-trained on the Imagenet classification dataset. • To train our system we use two dataset variants; o One that uses only COCO data (COCO-only), o One that appends COCO-only to an internal dataset (COCO + int). • From the 66K images (273K person instances) in the COCO train + val splits. • From the 62K images (105K person instances) in COCO-only model training and use the remaining 4,301 annotated images as mini-val evaluation set. • Our COCO + int = COCO-only + 73 K images from Flickr. • This in-house dataset contains an additional 227K person instances. 34 Evaluation (Datasets)
  • 35. • Faster-RCNN person box detection module trained exclusively on the COCO-only dataset. • Experimented training ResNet-based pose estimation module either on the COCO-only or on the augmented COCO + int datasets (1:1 ratio) and present results for both. • For COCO+int pose training we use mini-batches that contain COCO and in-house annotation instances in 1:1 ratio. 35 Evaluation (Datasets)
  • 36. 36 Evaluation (COCO Keypoints Detection State- of-the-Art)
  • 37. 37 Evaluation (Results Visualization) Figure 12. Results visualization of randomly picked image from COCO test-dev set
  • 38. 38 Evaluation (Results Visualization) Figure 13. Results visualization of randomly picked image from COCO test-dev set
  • 39. 39 Evaluation (Results Visualization) Figure 14. Results visualization of randomly picked image from COCO test-dev set
  • 40. Ablation Study (Box detection and pose estimation) 40
  • 41. 41 Ablation Study (OKS Non Maximum Suppression)
  • 42. Weaknesses • They did not sourced the system yet (even so they promised!) • The output of the first head passes through a sigmoid function (heatmaps probability) • 𝜆ℎ= 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function terms(why these values!) 42
  • 43. Strengths • Multi-person • Clutter Environment. • People Near each others. • High Accuracy • Reproducibility 43
  • 44. Discussion Feel free to ask any question 44
  • 45. 45
  • 46. REFERENCES [1] http://news.mit.edu/2011/mars-rover-0725 [2] www.forbes.com, static.independent.co.uk and fortune.com [3] M. A. Fischler and R. Elschlager. The representation and matching of pictorial structures. In IEEE TOC, 1973. [4] A. Toshev and C. Szegedy. Deep pose: Human pose estimation via deep neural networks. In CVPR, 2014 [5] A. Jain, J. Tompson, M. Andriluka, G. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. In ICLR, 2014 [6] M. A. Fischler and R. Elschlager. The representation and matching of pictorial structures. In IEEE TOC, 1973. 46
  • 47. REFERENCES [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time object detection with region proposal networks. In NIPS, 2015. 47
  • 48. Extra Slides (IoU, OKS) 48 http://image- net.org/challenges/talks/2016/ECCV2016_workshop_presentat ion_keypoint.pdf
  • 50. What is AP, AP.5, AP.75, 50http://cocodataset.org/#detection-eval

Editor's Notes

  1. My name is Abdulrahman Haje Karim, M.Sc. student in Computer engineering’s department. My main interest is the intersection area between CV, CG and deep learning fields. I will be presenting a brilliant work done by a google research group. The works titled with “Towards Accurate Multi-person Pose Estimation in the wild”
  2. First of all, let us start with fancy introduction including my own and authors motivation then, let us examine the problem in our hands. After that let us study in details how they could solve the problem . Then let us discuss the results they have achieved and let me conclude with my point of view including strengths and weakness of this work. Finally, let us open a slot for your questions, suggestions and your point-of-views regarding this research paper.
  3. well, Computer vision goes back to around 500 million years B.C where the first creatures on earth started to develop their visual system to better hide from enemies and search for food. Refer to introduction to deep learning done by Fie-Fie Li Stanford University What is interesting more is that “half of the human brain is devoted directly or indirectly to vision”. Professor Mriganka Sur says from MIT dep. of brain and cognitive science.
  4. Maybe you were surprised to see that strange photo from Mars while talking about our paper. You are right but let me share with you an article that talks about human exploration to Mars. You can see the first three lines: that the main concerns to scientist is that we will able to get a visual representation of that new planet which will lead us to more knowledge, assumptions and predictions.
  5. Obviously, A lot of real world problems started to find a solution with the help of computer vision. However, in most of the problems the human is present which forced computer vision scientists to care the most! Human detection and pose estimation play a central role in almost all computer vision applications. Some examples, are Human-Robot-Interaction (HRI), video surveillance self driving cars Now we understand the importance of computer vision and its direct relation with humans
  6. Now we understand the importance of computer vision and its direct relation with humans Why to select this paper? It is a hot topic in both virtual and augmented reality. Many papers try to solve the pose estimation problem with the assumption that the location or/and scale of the person instances is given as a ground truth which simplifies the problem and does limit the generality of the solution. Many papers tried to solve the previous problem for single person or where the people are far from each others. However, this paper tackle the problem when people are close to each other, which make it quite difficult to solve the association problem of determining which body part belongs to which person.
  7. We would like to detect many people and to perform pose estimation on them.
  8. We would like to detect many people and to perform pose estimation on them.(again)
  9. Most of the work that was done, as mentioned before, was focusing on Single Person in the scene….where the algorithm fails for situation were many people exist in the scene. LSP : The Leeds Sports Pose dataset contains 2000 pose annotated images of mostly sports people gathered from Flickr using the tags shown above. The images have been scaled such that the most prominent person is roughly 150 pixels in length. Each image has been annotated with 14 joint locations. Left and right joints are consistently labelled from a person-centric viewpoint
  10. Most of the work that was done, as mentioned before, was focusing on Single Person in the scene….where the algorithm fails for situation were many people exist in the scene.
  11. 1. Each part represents local visual properties. 2. “Springs” capture spatial relationships. 3. Matching model to image involves joint optimization of part locations “stretch and fit”
  12. OKS can be thought as Intersection of Unions (IoU) for zero no match for 1 exact match.
  13. Bottom-up : Detect body parts instead of full persons, then subsequently associate these parts to human instances, thus performing pose estimation in a bottom up fashion. Such approaches employ part detectors and differ in how associations among parts are expressed, and the inference procedure used to obtain full part groupings into person instances. Top-down : First perform person detection, then make the pose estimation. The second one is going to be used in our paper.
  14. First Stage : we employ a Faster-RCNN person detector to produce a bounding box around each candidate person instance.(proposal) Second Stage : we apply a pose estimator to the image crop extracted around each candidate person instance in order to localize its keypoints and re-score the corresponding proposal. (a refinement ) (i) Go beyond bounding boxes and predict keypoints and (ii) Rescore the detection based on the estimated keypoints. To the second stage person box detection proposals with score higher than 0.3, resulting in only 3.5 proposals per image on average.
  15. We slide the RPN (Region Proposal Networks) over the feature map and we generate k-proposals (with different sizes/aspect ratio ) with a score. If score of anchor box is greater than 0.7 or max then accept. If score of anchor box is less than 0.3 then reject. Each sliding window is mapped to a lower-dimensional vector (256-d for ZF and 512-d for VGG). This vector is fed into two sibling fully-connected layers—a box-regression layer (reg) and a box-classification layer (cls).
  16. Atrous convolution to generate denser feature maps with output stride equal to 8 pixels instead of the default 32 pixels. and the box annotations for the remaining 79 COCO categories have been ignored.
  17. Solution to muli-person : One approach would be to use a single regressor per keypoint, as in [45], but this is problematic when there is more than one person in the image patch (in which case a keypoint can occur in multiple places). Logical solution is to predict activation maps, as in [27], which allow for multiple predictions of the same keypoint. the size of the activation maps, and thus the localization precision, is limited by the size of the net’s output feature maps, which is a fraction of the input image
  18. We use the accurate ResNet-101 (353x257) pose estimator with disk radius R = 25 pixels in the rest of the paper.
  19. Make all boxes have the same fixed aspect ratio ( by extending either the height or the width ) of the boxes returned by the person detector without distorting the image aspect ratio. we use a rescaling factor equal to 1.25 during evaluation and a random rescaling factor between 1.0 and 1.5 during training (for data augmentation). Aspect ratio = (353/257) = 1.37
  20. in a fully convolutional fashion to produce heatmaps (one channel per keypoint) and offsets (two channels per keypoint for the x and y- directions) for a total of 3K output channels, where K = 17 is the number of keypoints. Atrous to generate denser feature map.
  21. k is the index of the keypoint. Very hard to train, ok then let us approximate it.
  22. This is a form of Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every keypoint, Each vote is weighted by the probability that it is in the disk of influence of the corresponding keypoint. The normalizing factor equals the area of the disk and ensures that if the heatmaps and offsets were perfect, then fk(xi) would be a unit-mass delta function centered at the position of the k-th keypoint. G(.) is the bilinear interpolation kernel
  23. Classifier and regressor ….classifier for heatmaps(hk(xi)) and regressor for offsets. The training target for heatmaps is a map of zeros and ones, ones if you are in the vicinity of the main joint and zeroes otherwise. 𝜆 ℎ = 4 and 𝜆 𝑜 = 1 is a scalar factor to balance the loss function Logistic loss is like hinge loss but logarithmic 𝑭 𝒌 is the predicted offset and 𝒍 𝒌 − 𝒙 𝒊 is the ground truth offset When computing hearmap loss, we treat as positives only the disks around the keypoints of the foreground person and as negatives everything else, forcing the model to predict correctly the keypoints of the person in the center of the box
  24. At test time, rather than just relying on the confidence from the person detector, we compute a refined confidence estimate, which takes into account the confidence of each keypoint. Average the confidences of all estimated keypoints.
  25. We have implemented out system in Tensorflow. http://ruder.io/optimizing-gradient-descent/ Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. gamma<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
  26. Table 1 shows the COCO keypoint test-dev split performance of our system trained on COCO-only or trained on COCO+int datasets. Table 2 shows the COCO keypoint test-standard split results of our model with the pose estimator trained on either COCO-only or COCO+int training set.
  27. For simplicity and to facilitate reproducibility we do not utilize multi-scale evaluation or model ensembling in the Faster-RCNN person box detection stage. Using such enhancements can further improve our results at the cost of significantly increased computation time.
  28. Gamma is the momentum Miu is the learning rate Theta is the parameter J(theta) is the score function SGD: performs a parameter update for each training example