Faster R-CNN

Faster R-CNN integrate the region proposal algorithm into the CNN model. It construct a single, unified model composed of RPN (region proposal network) and fast R-CNN with shared convolutional feature

1. Models comparison

Difference between R-CNN, Fast R-CNN, and Faster R-CNN
  • R-CNN:

    • Use Selective search to find ~2K region proposals

    • For each proposed region, use CNN to extract feature for object detection

  • Fast R-CNN:

    • Use Selective search to find ~2K region proposals

    • Resize the proposed region to the same size

    • Use shared weight CNN to extract features for object detection

  • Faster R-CNN:

    • Use Region Proposal Network to find potential object regions

    • Perform Fast R-CNN for object detection

2. Region Proposal Network (RPN)

  1. The picture goes through conv layers and feature maps are extracted

  2. Use a sliding window for each location over feature map

  3. For each location, k (k = 9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals.

    1. cls layer: outputs 2k (2k = 18) scores whether there is object or not for k boxes.

    2. reg layer: outputs 4k (4k = 36) for the coordinates (box center coordinates, width and height) of k boxes.

  4. For a W×HW\times H feature map, there are W×H×kW\times H\times kanchors in total

    1. Ignore cross-boundary anchors -> ~6,000 left

    2. Apply Non-Max Suppression -> ~2,000 left

Steps in RPN

The loss function of RPN: L(pi,ti)=1Ncls∑iLcls(pi,pi∗)+λ1Nreg∑ipi∗Lreg(ti,ti∗)L(p_i, t_i) = \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p_i^*) + \lambda\frac{1}{N_{reg}}\sum_i p_i^* L_{reg}(t_i, t_i^*)

  • LclsL_{cls}: loss function for classification, which is a softmax loss function

    • pip_i: predicted probability

    • pi∗={1for  positive  anchor0for  negative  anchorp_i^*=\begin{cases} 1 & for \; positive \; anchor \\ 0 & for \; negative \; anchor \end{cases}

    • NclsN_{cls}: number of anchors in mini-batch (512)

      • It is selected from total anchors after NMS, which is ~2000.

  • LregL_{reg}: box regressor. Lreg=SmoothL(ti−ti∗)L_{reg} = Smooth_{L}(t_i - t_i^*)

    • λ\lambda: constant value. From the paper, the optimal value from {0.1, 1, 10, 100} is 10

    • NregN_{reg}: number of total anchors (~2000)

    • tit_i: predicted box {tx,ty,tw,th}\{t_x, t_y, t_w, t_h\}

    • ti∗t_i^*: ground truth box {tx∗,ty∗,tw∗,th∗}\{t_x^*, t_y^*, t_w^*, t_h^*\}

    • smoothL1(x)={0.5x2if  ∣x∣<1∣x∣−0.5otherwisesmooth_{L_1}(x) = \begin{cases} 0.5x^2 & if \; |x|<1 \\ |x|-0.5 & otherwise \end{cases}

3. Architecture of faster R-CNN

Faster R-CNN = Fast R-CNN + RPN

4. How to train a Faster R-CNN

  • Pre-train a CNN network on image classification tasks.

  • Proposer. Train (fine-tune) RPN with ImageNet pre-trained model.

    • Positive samples have IoU > 0.7, while negative samples have IoU < 0.3

    • Slide a small n x n spatial window over the CONV feature map of the entire image.

    • At the center of each sliding window, we predict multiple regions of various scales and ratios simultaneously. In our case, k = 9

  • Detector. Train (fine-tune) a separate Fast R-CNN object detection model using the proposals generated by the previous step RPN (Conv layers not yet shared)

  • Proposer2. Fix the shared CONV layers, use the Fast R-CNN network to initialize RPN training and only fine-tune unique layers of RPN

  • Detector2. Fix CONV layer, fine-tune the unique layers (FC layers) of Fast CNN

  • Lastly, repeat Proposer2 and Detector2 alternatively if needed

Last updated

Was this helpful?