Fast R-CNN

It improved the training procedure by unifying three independent models into one jointly trained framework and increasing shared computation results

1. Model Workflow

How fast R-CNN works:

  • Pre-train a convolutional neural network on image classification tasks.

  • Propose regions by selective search (~2k candidates per image).

  • Modify the pre-trained CNN:

    • Replace the last max pooling layer of the pre-trained CNN with a RoI pooling layer. The RoI pooling layer outputs fixed-length feature vectors of region proposals.

    • Replace the last fully connected layer and the last softmax layer (K classes) with a fully connected layer and softmax over K + 1 classes.

  • Model branches into two output layers

    • A softmax estimator of K + 1 classes (same as in R-CNN, +1 is the “background” class), outputting a discrete probability distribution per RoI.

    • A bounding-box regression model which predicts offsets relative to the original RoI for each of K classes.

2. RoI pooling layer

RoI pooling is used to convert features in a region of the image with size, h×wh\times w, into a small fixed window, H×WH \times W . The input region is divided into HxW grids, and each sub-window with size h/H×w/Wh/H \times w/W. Then, apply max pooling in each grid.

3. Multi-task loss function (classification + regression)

The overall loss function is L(p,u,tu,v)=Lcls(p,u)+λ[u>=1]Lbox(tu,v)L(p,u,t^u, v) = L_{cls}(p,u) + \lambda[u>=1] L_{box}(t^u,v)

Where, Lcls(p,u)=logpuLbox(tu,v)=i{x,y,w,h}smoothL1(tiuvi)λ[u>=1]={1if  u>=10otherwiseL_{cls}(p, u) =-logp_u \\ L_{box}(t^u,v) = \sum_{i\in \{x,y,w,h\}} smooth_{L_1}(t^u_i-v_i) \\ \lambda[u>=1] = \begin{cases} 1 & if \; u>=1 \\ 0 & otherwise \end{cases}

  • uu --- True class label, u0,1,2,...,Ku \in 0, 1, 2, ..., K; the catch-all background class has u=0u = 0

  • pp --- Discrete probability distribution (per RoI) over K+1 classes. p=(p0,p1,...,pK)p = (p_0, p_1, ..., p_K), computed by a softmax over the K+ 1 outputs of a fully connected layer

  • vv --- True bounding box v=(vx,vy,vw,vh)v = (v_x, v_y, v_w, v_h)

  • tut^u --- Predicted bounding box correction. tu=(txu,tyu,twu,thu)t^u = (t^u_x, t^u_y, t^u_w, t^u_h)

The bounding box loss LboxL_{box} measure the difference between tiut^u_i and viv_ibased on a robust loss function below. It is less sensitive to outliers.

smoothL1(x)={0.5x2if  x<1x0.5otherwisesmooth_{L_1}(x) = \begin{cases} 0.5x^2 & if \; |x|<1 \\ |x|-0.5 & otherwise \end{cases}

4. Speed Bottleneck

  • Fast R-CNN is much faster in both training and testing time.

  • However, the improvement is not dramatic because the region proposals are generated separately by another model and that is very expensive.

Last updated

Was this helpful?