VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Â
PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild
1. PR:185
RetinaFace: Single-stage Dense Face
Localisation in the Wild
visionNoobDeng, Jiankang, et al. "RetinaFace: Single-stage Dense Face Localisation in the Wild." arXiv preprint arXiv:1905.00641 (2019).
(Submitted on 2 May 2019 (v1), last revised 4 May 2019 (this version, v2))
2. Face Detection
state-of-the-art face detection
Definition : face localization
Broader definition : face localization + landmark detection + pixel-wise face parsing + 3d reconstruction
3. Encoder
Encoder
â"#$ Unit vector
Similarity
[0,1]
if (similarity < threshold):
same!
else:
no same!
L2norm
L2norm
Unit vector
Preprocessing
Preprocessing
â"#$
0. Face Recognition
NaĂŻve Example : Face Verification
8. 1. Introduction
1.3 Main Contributions
1. Based on a single-stage design, we propose a novel pixel-wise face localisation
method named RetinaFace, which employs a multi-task learning strategy to
simultaneously predict face score, face box, five facial landmarks, and 3D position and
correspondence of of each facial pixel.
2. On the WIDER FACE hard subset, RetinaFace outperforms the AP of the state of the
art two-stage method.
3. On the IJB-C dataset, RetinaFace helps to improve ArcFaceâs verification accuracy.
4. By employing light-weight backbone networks, RetinaFace can run real-time on a
single CPU core for a VGA-resolution image.
5. Extra annotations and code have been released to facilitate future research.
9. WIDER Face & Person Challenge 2019
Track 1: Face Detection Track 2: Pedestrian Detection
Track 3: Cast Search by Portrait Track 4: Person Search by Language
http://wider-challenge.org/2019.html
10. 2. Related Work
2.1 Image Pyramid vs Feature Pyramid
2. Related Work
2.1. Image pyramid v.s. feature pyramid
2.2. Two-stage v.s. single-stage
2.3. Context Modelling
2.4. Multi-task Learning
Hao, Zekun, et al. "Scale-aware face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
Feature PyramidImage Pyramid
11. 2. Related Work
2.2 Two-stage v.s. single-stage
2. Related Work
2.1. Image pyramid v.s. feature pyramid
2.2. Two-stage v.s. single-stage
2.3. Context Modelling
2.4. Multi-task Learning
12. 2. Related Work
2.3 Context Modeling
2. Related Work
2.1. Image pyramid v.s. feature pyramid
2.2. Two-stage v.s. single-stage
2.3. Context Modelling
2.4. Multi-task LearningContext Module
To enhance the modelâs contextual reasoning power.
13. 2. Related Work
2.3 Context Modeling
2. Related Work
2.1 Image pyramid v.s. feature pyramid
2.2 Two-stage v.s. single-stage
2.3 Context Modelling
2.4 Multi-task LearningDeformable Convolutional Network
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017. 2,
X. Zhu, H. Hu, S. Lin, and J. Dai. Deformable convnets v2: More deformable, better results. arXiv:1811.11168, 2018.
14. 2. Related Work
2.4 Multi-task Learning
2. Related Work
2.1. Image pyramid v.s. feature pyramid
2.2. Two-stage v.s. single-stage
2.3. Context Modelling
2.4. Multi-task Learning
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
Mask-rcnn
Multi-task learning
15. 3. RetinaFace
3.1. Multi-task Loss
3. RetinaFace
3.1. Multi-task loss
3.2. Dense Regression Branch
Multi-task learning
16. 3. RetinaFace
3.2. Dense Regression Branch
3. RetinaFace
3.1. Multi-task loss
3.2. Dense Regression Branch
Zhou, Yuxiang, et al. "Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2019.
17. 4. Experiments
4.1 Dataset
WIDER face (hard)
- 32,203 images, 393,703 face bboxes
(with a high degree of variability in scale, pose, expression, occlusion and illumination)
18. car accident coupleconcert
4. Experiments
4.1 Dataset
WIDER face (hard)
- 32,203 images, 393,703 face bboxes
(with a high degree of variability in scale, pose, expression, occlusion and illumination)
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
19. 4. Experiments
4.1 Dataset
Extra Annotation
- Facial landmarks (eye centres, nose tip and mouth corners)
- 84.6k faces on the training set and 18.5k faces on the validation set.
20. 4. Experiments
4.2 Implementation details
1. Feature pyramid
2. Context module
3. Anchor setting
4. Data augmentation
5. Training detail
6. Testing detail
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
# of anchors * (2 + 4 + 10 + 128 + 7 + 9)Conv -> DCN
21. 4. Experiments
4.2 Implementation details
Anchor setting
- Scale step at 2^(1/3) and the aspect ratio at 1:1
- With the input image size at 640 Ă 640, the anchors can cover
scales from 16 Ă 16 to 406 Ă 406 on the feature pyramid levels.
In total, there are 102,300 anchors, and 75% of these anchors are
from P2.
- OHEM
- 1:3 (pos : neg)
1. Feature pyramid
2. Context module
3. Anchor setting
4. Data augmentation
5. Training detail
6. Testing detail
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
22. 4. Experiments
4.2 Implementation details
Data augmentation
- Random crop
- Horizontal flip
- Photo-metric color distortion
Training Details
- SGD (momentum at 0.9, weight decay at 0.0005, batch size of 8 Ă 4)
- on four NVIDIA Tesla P40 (24GB) GPUs.
- The learning rate starts from 10â3, rising to 10â2 after 5 epochs,
then divided by 10 at 55 and 68 epochs.
- terminating at 80 epochs.
Testing Details
- flip as well as multi-scale (the short edge of image at [500, 800, 1100, 1400, 1700]) strategies.
- Box voting at IoU at 0.4 -> or NMS is okay
1. Feature pyramid
2. Context module
3. Anchor setting
4. Data augmentation
5. Training detail
6. Testing detail
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
23. 4. Experiments â Ablation study
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
24. 4. Experiments
4.3. Ablation Study
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
IoU=0.5:0.05:0.95IoU=0.5
25. 4. Experiments
4.3. Ablation Study
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
IoU=0.5:0.05:0.95IoU=0.5
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
From Mask r-cnn
26. 4. Experiments : Face Box Accuracy
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
27. 4. Experiments
4.4. Face box Accuracy (WIDER face)
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
28. 4. Experiments : Five Facial Landmarks Accuracy
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
29. 4. Experiments
4.5. Five Facial Landmark Accuracy
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
cumulative error distribution (CED)normalised mean errors (NME)
https://pdfs.semanticscholar.org/b4d2/151e29fb12dbe5d164b430273de65103d39b.pdf
26.31%
9.37%
30. 4. Experiments : Dense Facial Landmark Accuracy
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
31. 4. Experiments
4.6. Dense Facial Landmark Accuracy
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
32. 4. Experiments : Face Recognition Accuracy
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
33. 4. Experiments
4.7. Face Recognition Accuracy
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
34. 4. Experiments : Inference Accuracy
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
35. 4. Experiments
4.8. Inference Efficiency
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
https://github.com/deepinsight/insightface/tree/master/RetinaFace
36. 4. Experiments
4.8. Inference Efficiency
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
https://github.com/deepinsight/insightface/tree/master/RetinaFace
Yoo, YoungJoon, Dongyoon Han, and Sangdoo Yun. "EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse." arXiv preprint arXiv:1906.06579 (2019).
37. 4. Experiments
4.8. Inference Efficiency
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency
https://github.com/deepinsight/insightface/tree/master/RetinaFace
Yoo, YoungJoon, Dongyoon Han, and Sangdoo Yun. "EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse." arXiv preprint arXiv:1906.06579 (2019).
38. 5. Conclusion
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision
Code is available at https://github.com/deepinsight/insightface
(MXNet)