R-FCN : object detection via region-based fully convolutional networks
1. R-FCN: Object Detection via Region-based Fully
Convolutional Networks
Paper : https://arxiv.org/abs/1605.06409
Authors:
Jifeng Dai Microsoft Research
Yi Li∗ Tsinghua University
Kaiming He Microsoft Research
Jian Sun Microsoft Research
Presented By:
Ashish
2. R-CNN based detectors
● Fast R-CNN or Faster R-CNN,
○ Fast R-CNN computes the feature maps from the whole image once.
○ It then derives the region proposals (ROIs) from the feature maps directly.
○ For every ROI, no more feature extraction is needed (about 2000 ROIs)
● Processes object detection in 2 stages,
○ Generate region proposals (ROIs), and
○ Make classification and localization (boundary boxes) predictions from ROIs.
3. R-FCN
● R-FCN improves speed by reducing the amount of work needed for each ROI
● The region-based feature maps are independent of ROIs and can be computed outside
each ROI.
● R-FCN is faster than Fast R-CNN or Faster R-CNN.
● R-FCN is Region-based, fully convolutional network
4. Backbone Architecture
1. R-FCN in this paper is based on ResNet-101,
2. ResNet-101 has 100 convolutional layers followed by global average pooling and a
1000-class fc layer.
3. We remove the average pooling layer and the fc layer and only use the convolutional
layers to compute feature maps.
4. We use ResNet-101 pre-trained on ImageNet .
5. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly
initialized 1024-d 1×1 convolutional layer for reducing dimension.
6. Then we apply the k 2 (C + 1)-channel convolutional layer to generate score maps, as
introduced next.
7. Region based, fully convolutional network for accurate and efficient object
detection.
8. Position Sensitive Score Maps
● Each map detects (scores) a sub-region of the object.
Source: https://medium.com/@jonathan_hui/understanding-region-based-fully-convolutional-networks-r-fcn-for-object-detection-828316f07c99
10. Class Score
Let’s say we have C classes to detect.
We expand it to C + 1 classes so we include a new class for the background (non-object).
Each class will have its own 3 × 3 score maps and therefore a total of (C+1) × 3 × 3 score maps.
Source: https://medium.com/@jonathan_hui/understanding-region-based-fully-convolutional-networks-r-fcn-for-object-detection-828316f07c99
11. Classification
Using its own set of score maps, we predict a class score for each class.
Then we apply a softmax on those scores to compute the probability for each class.
13. Training
1. With pre-computed region proposals, it is easy to end-to-end train the R-FCN
architecture.
2. loss function defined on each RoI is the summation of the cross-entropy loss
and the box regression loss
3. positive RoIs that have intersection-over-union (IoU) overlap with a
ground-truth box of at least 0.5, and negative otherwise.
Uses online hard example mining (OHEM) [22] during training. Online Hard
Example Mining (OHEM) is an online bootstrapping algorithm for training
region-based ConvNet object detectors like Fast R-CNN.
Our negligible per-RoI computation enables nearly cost-free example mining.
14. Training
1. Assuming N proposals per image, in the forward pass, we evaluate the loss of
all N proposals.
2. Then we sort all RoIs (positive and negative) by loss and select B RoIs that
have the highest loss.
3. Backpropagation is performed.
4. weight decay of 0.0005 and a momentum of 0.9.
5. By default single-scale training: images are resized such that the scale (shorter
side of image) is 600 pixels.
6. Each GPU holds 1 image and selects B = 128 RoIs for backprop.
7. We train the model with 8 GPUs.
8. We fine-tune R-FCN using a learning rate of 0.001 for 20k mini-batches and
0.0001 for 10k mini-batches on VOC.
16. Inference
1. The feature maps shared between RPN and R-FCN are computed (on an
image with a single scale of 600).
2. Then the RPN part proposes RoIs, on which the R-FCN part evaluates
category-wise scores and regresses bounding boxes.
3. During inference we evaluate 300 RoIs.
4. The results are post-processed by non-maximum suppression (NMS) using a