These slides provide an overview of the most popular approaches up to date to solve the task of object detection with deep neural networks. It reviews both the two stages approaches such as R-CNN, Fast R-CNN and Faster R-CNN, and one-stage approaches such as YOLO and SSD. It also contains pointers to relevant datasets (Pascal, COCO, ILSRVC, OpenImages) and the definition of the Average Precision (AP) metric.
Full program:
https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgraduate-course-artificial-intelligence-deep-learning/
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020
1. Object Detection
Computer Vision 2
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Spring 2020
9. Object Detection
CAT, DOG, DUCK
The task of assigning a label and a
bounding box to all objects in the
image:
1. We don’t know number of objects
2. Object detection relies on object
proposal and object classification
9
10. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
10
11. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
11
12. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? YES
Dog ? NO
Duck? NO
12
13. Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
13
Object Detection as Classification
14. Challenge:
Very large amount of possibilities:
● position
● scale
● aspect ratio
14
Object Detection as Classification
Question: Do you think it is feasible to evaluate all possibilities ?
15. Challenge:
Very large amount of possibilities:
● position
● scale
● aspect ratio
Solution: If your classifier is fast enough, go for it
15
Object Detection as Classification
16. Object Detection with ConvNets?
Convnets are computationally demanding. We can’t test all positions & scales !
Solution: Look at a tiny subset of positions. Choose them wisely :)
16
21. Open Images Dataset
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]
22. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]
Open Images Dataset v6
PASCAL
20 categories
6k training images
6k validation images
10k test images
ILSVRC
200 categories
456k training images
60k validation + test images
COCO
80 categories
200k training images
60k val + test images
23. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]
Open Images Dataset v6
24. Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]
Images with a large number of different classes annotated (11
on the left, 7 on the right).
Open Images Dataset v6
26. 26
Evaluation metrics: Intersection over Union (IoU)
● aka Jaccard index
● Size of intersection divided by the size of
the union
● Evaluate localization
Figure: Pyimagesearch
27. 27
Metric: Average Precision (AP) for Object Detection
Consider the case in which your object detection algorithm provides you:
● Coordinates for each bounding box.
● A confidence for each bounding box
0.7
0.9
Predictions
0.5
28. 28
Rank your predictions based on the confidence score of your object detection
algorithm:
0.7
0.9
0.9
0.7
#1
#2
#3
Predictions
Metric: Average Precision (AP) for Object Detection
0.5
0.5
29. 29
Set a criteria to identify whether your predictions are correct.
Typically, a minimum IoU with respect to the bounding boxes from the ground truth annotation.
○ For example, IoU > 0.5. This is referred as AP0.5
.
○ Other popular options: AP0.75
, or a range of IoU [0.5:0.95] in 0.05 steps
○ Each GT box can only be assigned to one predicted box.
0.7
0.9
0.9
0.7
#1
#2
#3
Ground truth True Positive (TP)
False Positive (FP)
0.5
0.5
Confidencescore
Metric: Average Precision (AP) for Object Detection
30. 30
Compute the point of the Precision-Recall curve by considering as decision thresholds (Thr) the
confidence scores of the ranked detections.
Rank Correct ?
1 True
2 False
3 True
Ground truth True Positive (TP)
False Positive (FP) or
False Negative (FN)
0.7
0.9
0.5
Threshold Precision Recall
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
Metric: Average Precision (AP) for Object Detection
31. 31
In the object detection case, in which GT objects may never any predictions, we may consider that
trying to find the missing objects with an infinite amount of object proposals would drop precision
to 0.0, but would eventually find all objects, so recall would be 1.0
Table inspired by: Johnatan Hui, “mAP (mean Average Precision) for Object Detection” (Medium 2018)
Ground truth True Positive (TP)
False Positive (FP) or
False Negative (FN)
0.7
0.9
0.5
Threshold Precision Recall
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Metric: Average Precision (AP) for Object Detection
33. 33
“The precision at each recall level r is interpolated by taking the maximum precision (...) for which the
corresponding recall exceeds r.” (from Pascal VOC) [ref]
[ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual
Object Classes (VOC) challenge." IJCV 2010.
Metric: Average Precision (AP) for Object Detection
Threshold Precision Recall
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Precision
Recall
1.0
0.5
0.5 1.00
34. 34
Actually, not all PR pairs need to be computed because AP for object detection only requires
the PR pairs related to True positives:
Threshold Precision Recall
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Metric: Average Precision (AP) for Object Detection
Precision
Recall
1.0
0.5
0.5 1.00
35. 35
● The AP metric approximates the area of the PR curve.
● There are different methods for this approximation that may cause
inconsistencies between implementations.
● Popular ones
○ (suggested) “the mean precision at a set of eleven equally spaced
recall levels [0, 0.1, ...1]”
○ “weighted mean of precisions achieved at each threshold, with the
increase in recall from the previous threshold used as the weight”
(scikit-learn).
[ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual
Object Classes (VOC) challenge." IJCV 2010.
Metric: Average Precision (AP) for Object Detection
36. 36
In our work, we adopt the approach from Pascal VOC:
● AP is “the mean precision at a set of eleven equally spaced recall levels [0, 0.1, ...1]”
Threshold Precision Recall
0.9 1/1 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Recall Precision
0.0 1.00
0.1 1.00
0.2 1.00
0.3 0.67
0.4 0.67
0.5 0.00
... 0.00
1.0 0.00
AP 0.39
Precision
Recall
1.0
0.5
0.5 1.00
Metric: Average Precision (AP) for Object Detection
37. 37
What if your object detection algorithm does not provide any confidence score ?
#1
#2
#3
Predictions
Metric: Average Precision w/o confidence scores
?
38. 38
If your object detection algorithm does not provide any confidence score:
● Generate N random ranks (eg. N=10) and average your metrics across these N runs.
● Average the obtained APs.
#1
#2
#3
#1
#2
#3
#1
#2
#3
AP1
AP2
APN
...
AP
Metric: Average Precision w/o confidence scores
39. 39
Evaluation metrics: mean Average Precision (mAP)
In the cases of multiple Q classes (eg. car, bike, person…), the mAP averages
across the AP(q) of each class:
● Further readings:
○ Tarang Sangh, “Measuring Object Detection models — mAP — What is Mean Average Precision?” (Medium
2018)
40. 40
Evaluation metrics: Average Precision (AP)
You can obtain implementations for this Average Precision for Object Detection
from:
TensorFlow Microsoft CoCo dataset API
43. Object Detection
There are two main families:
● Two-Stage: Region proposal and then classification
● Single-Stage: A grid in the image where each cell is a
proposal
44. Region Proposals
● Find “blobby” image regions that are likely to contain objects
● “Class-agnostic” object detector
Slide Credit: CS231n 44
45. Region Proposals
45
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
46. Region Proposals
46
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
NMS: Non-Maximum Suppression
47. Region Proposals: from pixels
#SS Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. IJCV
2013
47
48. Region Proposals: from pixels
#MCG Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2016). Multiscale combinatorial grouping for
image segmentation and object proposal generation. TPAMI 2016
48
49. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014.
49
R-CNN
51. R-CNN + Non Maximum Suppression (NMS)
51
#DPM Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively
trained part-based models. TPAMI 2009.
Figure: Adrian Rosebrock
52. 52
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014.
R-CNN
53. R-CNN: Problems
1. Slow at test-time: need to run full forward pass of
CNN for each region proposal
2. SVMs and regressors are post-hoc: CNN features not
updated in response to SVMs and regressors
Slide Credit: CS231n 53
54. Fast R-CNN:
Girshick Fast R-CNN. ICCV 2015c
Solution: Share computation of convolutional layers between region proposals for an image
R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal
54
55. Fast R-CNN
Solution: Train it all together end to end
R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training.
55Girshick Fast R-CNN. ICCV 2015
-Softmax over (K+1) classes and 4 box offsets
-Positive box are the ones with larger Intersection
Over Union with ground truth
56. Fast R-CNN: RoI-Pooling
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
(variable size)
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
(fixed size)
Fully-connected layers expect
low-res conv features:
C x h x w
Slide Credit: CS231n 56Girshick Fast R-CNN. ICCV 2015
57. RoI poolings allow 1) to propagate gradient only on interesting
regions, and 2) efficient computing.
Input: convolutional map + N regions of interest
Output: tensor of N x 7 x 7 x depth features
Fast R-CNN: RoI-Pooling
58. Slide Credit: CS231n 58
Fast R-CNN
R-CNN Fast R-CNN
Training Time: 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
mAP (VOC 2007) 66.0 66.9
Using VGG-16 CNN on Pascal VOC 2007 dataset
Faster!
FASTER!
Better!
59. Fast R-CNN: Limitation
Slide Credit: CS231n
R-CNN Fast R-CNN
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image
with Selective Search
50 seconds 2 seconds
(Speedup) 1x 25x
Test-time speeds do not include region proposals
59
60. Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Fast R-CNN
60
Learn proposals end-to-end sharing parameters with the classification network
#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region
proposal networks. NIPS 2015.
Faster R-CNN
61. Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
61
Learn proposals end-to-end sharing parameters with the classification network
This network is called Region Proposal Network (RPN), and the proposals are learnt!!
#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region
proposal networks. NIPS 2015.
67. Two-stage vs Single-stage methods
67
Computationally too intensive and too slow for real-time
applications
Faster R-CNN 7 FPS
resample pixels for each BBOX
resample features for each BBOX
high quality
classifier
Object proposals
generation
Image
pixels
68. Two-stage vs Single-stage methods
68
resample pixels for each BBOX
resample features for each BBOX
high quality
classifier
Object proposals
generation
Image
pixels
Instead of having two networks
Region Proposals Network + Classifier Network
in one-stage architectures, bounding boxes and confidences for multiple categories
are predicted directly with a single network
73. 73
Problem:
Too many positions & scales to test
Modern detectors parallelize feature extraction across all
locations.
Region classification is not slow anymore!
Previously… :
One-stage methods
74. YOLO: You Only Look Once
74#YOLO Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Proposal-free object detection pipeline
S x S grid on input
For each cell of the S x S predict:
● B boxes and confidence scores C (5 x B values) + classes c
75. 75Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Proposal-free object detection pipeline
S x S grid on input
Bounding boxes + confidence
Class probability map
Final detections
YOLO: You Only Look Once
76. 76Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Proposal-free object detection pipeline
S x S grid on input
Bounding boxes + confidence
Class probability map
Final detections
Final detections:
Cj * prob(c) > threshold
YOLO: You Only Look Once
77. Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 77
YOLO: You Only Look Once
78. YOLO: You Only Look Once
78
Each cell predicts:
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
For Pascal VOC:
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
79. SSD: Single Shot MultiBox Detector
Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016
79
Same idea as YOLO, + several predictors at different stages in the network to allow different receptive
fields.
85. RetinaNet
85
Matching proposal-based performance with a one-stage approach
Problem of one-stage detectors? They evaluate many candidate locations but only
a few have objects ---> IMBALANCE, making learning inefficient
Focal loss: Key idea is to lower loss weight for well classified samples, increase it
for difficult ones.
Lin et al. Focal Loss for Dense Object Detection. ICCV 2017
88. Software implementations
88
Most models are publicly available ready to be used off-the-shelf.
Model Framework
Faster R-CNN [torchvision] (< suggested)
[Detectron2] [Keras]
RetinaNet [Detectron2] (< suggested)
[Keras]
Benchmark [TensorFlow Object Detection API]
YOLOv3 [PyTorch]
SSD [PyTorch] [Tutorial on Keras]
Mask R-CNN [torchvision] (< suggested)
[PyTorch] [Keras & TF] [tutorial]
89. Software implementations
89
Wang, Xin, Thomas E. Huang, Trevor Darrell, Joseph E. Gonzalez, and Fisher Yu. "Frustratingly Simple
Few-Shot Object Detection." arXiv preprint arXiv:2003.06957 (2020). [code based on Detectron 2]
Probably, you will not be interested in the object classes defined in Pascal/COCO. You can adapt
(fine-tune) existing models to your own object classes.
90. Software implementations for Mobile
90
TensorFlow Lite: Object Detection
PyTorch Mobile (no specific solutions for object detection)