Fast R-CNN

It improved the training procedure by unifying three independent models into one jointly trained framework and increasing shared computation results

1. Model Workflow

How fast R-CNN works:

Pre-train a convolutional neural network on image classification tasks.
Propose regions by selective search (~2k candidates per image).
Modify the pre-trained CNN:
- Replace the last max pooling layer of the pre-trained CNN with a RoI pooling layer. The RoI pooling layer outputs fixed-length feature vectors of region proposals.
- Replace the last fully connected layer and the last softmax layer (K classes) with a fully connected layer and softmax over K + 1 classes.
Model branches into two output layers
- A softmax estimator of K + 1 classes (same as in R-CNN, +1 is the “background” class), outputting a discrete probability distribution per RoI.
- A bounding-box regression model which predicts offsets relative to the original RoI for each of K classes.

2. RoI pooling layer

RoI pooling is used to convert features in a region of the image with size, $h\times w$ , into a small fixed window, $H \times W$ . The input region is divided into HxW grids, and each sub-window with size $h/H \times w/W$ . Then, apply max pooling in each grid.

3. Multi-task loss function (classification + regression)

The overall loss function is $L(p,u,t^u, v) = L_{cls}(p,u) + \lambda[u>=1] L_{box}(t^u,v)$

Where, $L_{cls}(p, u) =-logp_u \\ L_{box}(t^u,v) = \sum_{i\in \{x,y,w,h\}} smooth_{L_1}(t^u_i-v_i) \\ \lambda[u>=1] = \begin{cases} 1 & if \; u>=1 \\ 0 & otherwise \end{cases}$

$u$ --- True class label, $u \in 0, 1, 2, ..., K$ ; the catch-all background class has $u = 0$
$p$ --- Discrete probability distribution (per RoI) over K+1 classes. $p = (p_0, p_1, ..., p_K)$ , computed by a softmax over the K+ 1 outputs of a fully connected layer
$v$ --- True bounding box $v = (v_x, v_y, v_w, v_h)$
$t^u$ --- Predicted bounding box correction. $t^u = (t^u_x, t^u_y, t^u_w, t^u_h)$

The bounding box loss $L_{box}$ measure the difference between $t^u_i$ and $v_i$ based on a robust loss function below. It is less sensitive to outliers.

$smooth_{L_1}(x) = \begin{cases} 0.5x^2 & if \; |x|<1 \\ |x|-0.5 & otherwise \end{cases}$

4. Speed Bottleneck

Fast R-CNN is much faster in both training and testing time.
However, the improvement is not dramatic because the region proposals are generated separately by another model and that is very expensive.

PreviousR-CNN NextFaster R-CNN

Last updated 5 years ago

Was this helpful?