R-CNN

Region-CNN (R-CNN), which is a CNN-based deep learning object detection approaches.

Difficulties in object detection: (1) Objects have different aspect ratios and sizes; (2) Different image sizes also affect the window size. ==> Extremely slow to use CNN for image classification at each location

0. Idea of R-CNN:

  • R-CNN uses selective search to generate about 2K region proposals, i.e. bounding boxes for image classification.

  • For each bounding box, using CNN to perform image classification.

  • Each bounding box can be refined using regression.

R-CNN flowchart

1. Model Workflow

How R-CNN works:

  1. Pre-train a CNN network on image classification tasks, the classification task involves N classes

  2. Propose category-indepent regions of interest by selective search. ~2K candidates per image

  3. Warping region candidates to have a fixed size as required by CNN

  4. Continue fine-tuning the CNN on warped proposal regions for K+1 classes.

    1. The additional one class refers to the background (no object of interest).

    2. The learning rate should be small in this stage

    3. Use mini-batch to oversamples the positive cases since most proposed regions are just background

  5. For each candidate image region, one forward propagation through CNN to generates a feature vector.

    1. Then, use binary SVM to train each class independently

    2. The positive samples are proposed regions with IoU overlap threshold >= 0.3, and negative samples are irrelevent others.

  6. To reduce the localization errors, we use a regression model to correct the predicted detection window on bounding box correction offset using CNN features.

  • How Selective Search Works:

    • At the initialization stage, apply Felzenszwalb’s algorithm on graph-based image segmentation to create regions to start with.

    • Use a greedy algorithm to iteratively group regions together:

      • Calculate the similarities between all neighboring regions

      • Group the two most similar regions, and new similarities are calculated between the merged resulting region and its neighbors

    • Repeat the grouping step until the whole image becomes a single region

  • Felzenszwalb’s Algorithm (Image Segmentation)

    • We use an undirected graph G = (V, E) to represent an input image

      • Vertex vi∈Vv_i \in V represents one pixel

      • Edge e=(vi,vj)∈Ee = (v_i, v_j) \in E connects two vertices v_i and v_j

      • Weight w(vi,vj)w(v_i, v_j) measures the dissimilarity between v_i and v_j. The higher the weight, the less similar between two vertices

      • Segmentation SS is a partition of V into multiple connected components, {C}\{C\}

    • Key concepts for a good graph partition (image segmentation):

      • Internal difference: Int(C)=maxe∈MST(C,E)w(e)Int(C) = max_{e\in MST(C, E)} w(e), where MST is the minimum spanning tree of the components. A component C can still remain connected even when we have removed all the edges with condition: weights < Int(C).

      • Difference between two components: Diff(C1,C2)=minvi∈C1,vj∈C2,(vi,vj)∈Ew(vi,vj)Diff(C1, C2) = min_{v_i \in C_1, v_j \in C_2, (v_i, v_j) \in E} w(v_i, v_j). Diff(C1,C2)=∞Diff(C1, C2) = \infty if there is no edge in-between

      • Minimum internal difference: MinInt(C1,C2)=min(Int(C1)+Ï„(C1),Int(C2)+Ï„(C2))MinInt(C_1, C_2) = min(Int(C_1) + \tau(C_1), Int(C_2)+\tau(C_2)),

        • where Ï„(C)=k/∣C∣\tau(C) = k/|C|,

        ==> ensure a meaningful threshold for the difference between components

      ==> A higher K --> a larger components

      • Quality of a segmentation is assessed by a pairwise region comparison, C_1 and C_2:

        • D(C1,C2)={TrueifDiff(C1,C2)>MinInt(C1,C2)FalseotherwiseD(C_1,C_2) = \begin{cases} True & if Diff(C_1, C_2) > MinInt(C_1, C_2) \\ False & otherwise \end{cases}

        • Only if the predict holds True, we consider them as two independent components, otherwise, they may should be merged

    • How Felzenszwalb’s Algorithm works: G = (V, E) and |V| = n, and |E| = m

      • Edges are sorted by weight in ascending order, labeled as e1, e2, e3, ..., em.

      • All pixels stay in its own component and we start with n components

        • Repeat for k = 1, ..., m:

          • The segmentation snapshot at the step k is denoted as SkS^k

          • We take the k-th edge in the order, ek=(vi,vj)e_k = (v_i, v_j)

          • if v_i and v_j belong to the different components Cik−1C^{k-1}_i and Cjk−1C^{k-1}_jas in the segmentation Sk−1S^{k-1} , we want to merge them into one if w(vi,vj)≤MinInt(Cik−1,Cjk−1)w(v_i,v_j) \leq MinInt(C^{k-1}_i, C^{k-1}_j); otherwise do nothing

  • Similarity between neighboring regions

    • Goals:

      • Use multiple grouping criteria

      • Lead to a balanced hierarchy of small to large objects

      • Be efficient to compute: quickly combine measurements in two regions

    • For two regions (ri,rj)(r_i, r_j), selective search combines four similarities:

      s(ri,rj)=a1scolor(ri,rj)+a2stexture(ri,rj)+a3ssize(ri,rj)+a4sfill(ri,rj)s(r_i, r_j) = a_1 s_{color} (r_i, r_j) + a_2 s_{texture} (r_i, r_j)+ a_3 s_{size} (r_i, r_j)+ a_4 s_{fill} (r_i, r_j)

      • Color:

        • Create a color histogram C for each channel in region r.

        • Scolor(ri,rj)=∑k=1nmin(cik,cjk)S_{color}(r_i, r_j) = \sum^n_{k=1}min(c^k_i,c^k_j)

      • Texture:

        • Extract gaussion derivative of the images in 8 directions and for each channel

        • Construct a 10-bin histogram for each, resulting in a 240-dim descriptor

      • Size:

        • We want small regions to merge into larger ones, to create a balanced hierarchy

        • Add a size component to ensure small regions are more similar to each other

          • ssize(ri,rj)=1−size(ri)+size(rj)size(im)s_{size}(r_i, r_j) = 1 - \frac{size(r_i) + size(r_j)}{size(im)}

      • Shape:

        • Measure how well two regions "fit together":

        • fill(ri,rj)=1−size(BBij−size(ri)−size(rj))size(im)fill(r_i, r_j) = 1 - \frac{size(BB_{ij} - size(r_i) - size(r_j))}{size(im)}

3. Bounding Box Regression

The R-CNN model predicted a bounding box coordinate p=(px,py,pw,ph)p = (p_x, p_y, p_w, p_h) , (center coordinates, width, height). The ground truth box coordinates are g=(gx,gy,gw,gh)g = (g_x, g_y, g_w, g_h). The transformation function are:

gx^=pwdx(p)+pxgy^=phdy(p)+pygw^=pwedw(p)gh^=phedh(p)\hat{g_x} = p_w d_x(p)+p_x\\ \hat{g_y} = p_h d_y(p)+p_y\\ \hat{g_w} = p_w e^{d_w(p)}\\ \hat{g_h} = p_h e^{d_h(p)}\\

Therefore the boundign box correction functions, di(p)d_i(p) , where i∈{x,y,w,h}i \in \{x, y, w, h\}, can be represented as targets below:

tx=(gx−px)/pwty=(gy−py)/phtw=log(gw/pw)th=log(gh/ph)t_x = (g_x - p_x)/p_w \\ t_y = (g_y - p_y)/p_h \\ t_w = log(g_w/p_w) \\ t_h = log(g_h/p_h) \\

Furthermore, we can solve this problem by minimizing the SSE loss with regularization:

L=∑i∈{x,y,w,h}(ti−di(p))2+λ∣∣w∣∣2L = \sum_{i\in\{x,y,w,h\}} (t_i - d_i(p))^2 + \lambda||w||^2

Note: since not all the predicted bounding boxes have corresponding ground truth boxes, only a predicted box with a nearby ground truth box with at least 0.6 IoU is kept for training the bbox regression model

4. Non-Max Suppression

Dealing with multiple bounding box for the same object

  • With a set of matched bounding boxes for the same object:

    • Sort all the bounding boxes by confidence score, and discard boxes with low confidence scores

    • With the remaining boxes,

      • Greedily select the one with the highest score

      • Skip the remaining boxes with high IoU with previous selected ones.

5. Use hard negative mining data

For negative examples, not all of them are equally hard to be identified.

  • If it is just pure empty background, it is "easy negative"

  • If it contains partial object, it could be "hard negative"

==> We can use those false positive samples during training loop and include them in the training data to improve the performance

6. Speed bottleneck

  • Running selective search to propose 2000 region candidates for every image;

  • Generating the CNN feature vector for every image region (N images * 2000).

  • The whole process involves three models separately without much shared computation: the convolutional neural network for image classification and feature extraction; the top SVM classifier for identifying target objects; and the regression model for tightening region bounding boxes.

Last updated

Was this helpful?