Faster R-CNN

Faster R-CNN integrate the region proposal algorithm into the CNN model. It construct a single, unified model composed of RPN (region proposal network) and fast R-CNN with shared convolutional feature

1. Models comparison

R-CNN:
- Use Selective search to find ~2K region proposals
- For each proposed region, use CNN to extract feature for object detection
Fast R-CNN:
- Use Selective search to find ~2K region proposals
- Resize the proposed region to the same size
- Use shared weight CNN to extract features for object detection
Faster R-CNN:
- Use Region Proposal Network to find potential object regions
- Perform Fast R-CNN for object detection

2. Region Proposal Network (RPN)

The picture goes through conv layers and feature maps are extracted
Use a sliding window for each location over feature map
For each location, k (k = 9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals.
1. cls layer: outputs 2k (2k = 18) scores whether there is object or not for k boxes.
2. reg layer: outputs 4k (4k = 36) for the coordinates (box center coordinates, width and height) of k boxes.
For a $W\times H$ feature map, there are $W\times H\times k$ anchors in total
1. Ignore cross-boundary anchors -> ~6,000 left
2. Apply Non-Max Suppression -> ~2,000 left

The loss function of RPN: $L(p_i, t_i) = \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p_i^) + \lambda\frac{1}{N_{reg}}\sum_i p_i^ L_{reg}(t_i, t_i^*)$

$L_{cls}$ : loss function for classification, which is a softmax loss function
- $p_i$ : predicted probability
- $p_i^*=\begin{cases} 1 & for \; positive \; anchor \\ 0 & for \; negative \; anchor \end{cases}$
- $N_{cls}$ : number of anchors in mini-batch (512)
  - It is selected from total anchors after NMS, which is ~2000.
$L_{reg}$ : box regressor. $L_{reg} = Smooth_{L}(t_i - t_i^*)$
- $\lambda$ : constant value. From the paper, the optimal value from {0.1, 1, 10, 100} is 10
- $N_{reg}$ : number of total anchors (~2000)
- $t_i$ : predicted box $\{t_x, t_y, t_w, t_h\}$
- $t_i^*$ : ground truth box $\{t_x^*, t_y^*, t_w^*, t_h^*\}$
- $smooth_{L_1}(x) = \begin{cases} 0.5x^2 & if \; |x|<1 \\ |x|-0.5 & otherwise \end{cases}$

3. Architecture of faster R-CNN

4. How to train a Faster R-CNN

Pre-train a CNN network on image classification tasks.
Proposer. Train (fine-tune) RPN with ImageNet pre-trained model.
- Positive samples have IoU > 0.7, while negative samples have IoU < 0.3
- Slide a small n x n spatial window over the CONV feature map of the entire image.
- At the center of each sliding window, we predict multiple regions of various scales and ratios simultaneously. In our case, k = 9
Detector. Train (fine-tune) a separate Fast R-CNN object detection model using the proposals generated by the previous step RPN (Conv layers not yet shared)
Proposer2. Fix the shared CONV layers, use the Fast R-CNN network to initialize RPN training and only fine-tune unique layers of RPN
Detector2. Fix CONV layer, fine-tune the unique layers (FC layers) of Fast CNN
Lastly, repeat Proposer2 and Detector2 alternatively if needed

PreviousFast R-CNN NextMask R-CNN

Last updated 5 years ago

Was this helpful?