Faster R-CNN
Faster R-CNN integrate the region proposal algorithm into the CNN model. It construct a single, unified model composed of RPN (region proposal network) and fast R-CNN with shared convolutional feature
1. Models comparison

R-CNN:
Use Selective search to find ~2K region proposals
For each proposed region, use CNN to extract feature for object detection
Fast R-CNN:
Use Selective search to find ~2K region proposals
Resize the proposed region to the same size
Use shared weight CNN to extract features for object detection
Faster R-CNN:
Use Region Proposal Network to find potential object regions
Perform Fast R-CNN for object detection
2. Region Proposal Network (RPN)
The picture goes through conv layers and feature maps are extracted
Use a sliding window for each location over feature map
For each location, k (k = 9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals.
cls layer: outputs 2k (2k = 18) scores whether there is object or not for k boxes.
reg layer: outputs 4k (4k = 36) for the coordinates (box center coordinates, width and height) of k boxes.
For a feature map, there are anchors in total
Ignore cross-boundary anchors -> ~6,000 left
Apply Non-Max Suppression -> ~2,000 left

The loss function of RPN:
: loss function for classification, which is a softmax loss function
: predicted probability
: number of anchors in mini-batch (512)
It is selected from total anchors after NMS, which is ~2000.
: box regressor.
: constant value. From the paper, the optimal value from {0.1, 1, 10, 100} is 10
: number of total anchors (~2000)
: predicted box
: ground truth box
3. Architecture of faster R-CNN

4. How to train a Faster R-CNN
Pre-train a CNN network on image classification tasks.
Proposer. Train (fine-tune) RPN with ImageNet pre-trained model.
Positive samples have IoU > 0.7, while negative samples have IoU < 0.3
Slide a small n x n spatial window over the CONV feature map of the entire image.
At the center of each sliding window, we predict multiple regions of various scales and ratios simultaneously. In our case, k = 9
Detector. Train (fine-tune) a separate Fast R-CNN object detection model using the proposals generated by the previous step RPN (Conv layers not yet shared)
Proposer2. Fix the shared CONV layers, use the Fast R-CNN network to initialize RPN training and only fine-tune unique layers of RPN
Detector2. Fix CONV layer, fine-tune the unique layers (FC layers) of Fast CNN
Lastly, repeat Proposer2 and Detector2 alternatively if needed
Last updated
Was this helpful?