R-CNN

Region-CNN (R-CNN), which is a CNN-based deep learning object detection approaches.

Difficulties in object detection: (1) Objects have different aspect ratios and sizes; (2) Different image sizes also affect the window size. ==> Extremely slow to use CNN for image classification at each location

0. Idea of R-CNN:

R-CNN uses selective search to generate about 2K region proposals, i.e. bounding boxes for image classification.
For each bounding box, using CNN to perform image classification.
Each bounding box can be refined using regression.

1. Model Workflow

How R-CNN works:

Pre-train a CNN network on image classification tasks, the classification task involves N classes
Propose category-indepent regions of interest by selective search. ~2K candidates per image
Warping region candidates to have a fixed size as required by CNN
Continue fine-tuning the CNN on warped proposal regions for K+1 classes.
1. The additional one class refers to the background (no object of interest).
2. The learning rate should be small in this stage
3. Use mini-batch to oversamples the positive cases since most proposed regions are just background
For each candidate image region, one forward propagation through CNN to generates a feature vector.
1. Then, use binary SVM to train each class independently
2. The positive samples are proposed regions with IoU overlap threshold >= 0.3, and negative samples are irrelevent others.
To reduce the localization errors, we use a regression model to correct the predicted detection window on bounding box correction offset using CNN features.

2. Selective Search

How Selective Search Works:
- At the initialization stage, apply Felzenszwalb’s algorithm on graph-based image segmentation to create regions to start with.
- Use a greedy algorithm to iteratively group regions together:
  - Calculate the similarities between all neighboring regions
  - Group the two most similar regions, and new similarities are calculated between the merged resulting region and its neighbors
- Repeat the grouping step until the whole image becomes a single region
Felzenszwalb’s Algorithm (Image Segmentation)
- We use an undirected graph G = (V, E) to represent an input image
  - Vertex $v_i \in V$ represents one pixel
  - Edge $e = (v_i, v_j) \in E$ connects two vertices v_i and v_j
  - Weight $w(v_i, v_j)$ measures the dissimilarity between v_i and v_j. The higher the weight, the less similar between two vertices
  - Segmentation $S$ is a partition of V into multiple connected components, $\{C\}$
- Key concepts for a good graph partition (image segmentation):
  - Internal difference: $Int(C) = max_{e\in MST(C, E)} w(e)$ , where MST is the minimum spanning tree of the components. A component C can still remain connected even when we have removed all the edges with condition: weights < Int(C).
  - Difference between two components: $Diff(C1, C2) = min_{v_i \in C_1, v_j \in C_2, (v_i, v_j) \in E} w(v_i, v_j)$ . $Diff(C1, C2) = \infty$ if there is no edge in-between
  - Minimum internal difference: $MinInt(C_1, C_2) = min(Int(C_1) + \tau(C_1), Int(C_2)+\tau(C_2))$ ,
    where $\tau(C) = k/|C|$ ,
    ==> ensure a meaningful threshold for the difference between components
  ==> A higher K --> a larger components
  - Quality of a segmentation is assessed by a pairwise region comparison, C_1 and C_2:
    $D(C_1,C_2) = \begin{cases} True & if Diff(C_1, C_2) > MinInt(C_1, C_2) \\ False & otherwise \end{cases}$
    Only if the predict holds True, we consider them as two independent components, otherwise, they may should be merged
- How Felzenszwalb’s Algorithm works: G = (V, E) and |V| = n, and |E| = m
  - Edges are sorted by weight in ascending order, labeled as e1, e2, e3, ..., em.
  - All pixels stay in its own component and we start with n components
    Repeat for k = 1, ..., m:
    The segmentation snapshot at the step k is denoted as $S^k$
    We take the k-th edge in the order, $e_k = (v_i, v_j)$
    if v_i and v_j belong to the different components $C^{k-1}_i$ and $C^{k-1}_j$ as in the segmentation $S^{k-1}$ , we want to merge them into one if $w(v_i,v_j) \leq MinInt(C^{k-1}_i, C^{k-1}_j)$ ; otherwise do nothing
Similarity between neighboring regions
- Goals:
  - Use multiple grouping criteria
  - Lead to a balanced hierarchy of small to large objects
  - Be efficient to compute: quickly combine measurements in two regions
- For two regions $(r_i, r_j)$ , selective search combines four similarities:
  $s(r_i, r_j) = a_1 s_{color} (r_i, r_j) + a_2 s_{texture} (r_i, r_j)+ a_3 s_{size} (r_i, r_j)+ a_4 s_{fill} (r_i, r_j)$
  - Color:
    Create a color histogram C for each channel in region r.
    $S_{color}(r_i, r_j) = \sum^n_{k=1}min(c^k_i,c^k_j)$
  - Texture:
    Extract gaussion derivative of the images in 8 directions and for each channel
    Construct a 10-bin histogram for each, resulting in a 240-dim descriptor
  - Size:
    We want small regions to merge into larger ones, to create a balanced hierarchy
    Add a size component to ensure small regions are more similar to each other
    $s_{size}(r_i, r_j) = 1 - \frac{size(r_i) + size(r_j)}{size(im)}$
  - Shape:
    Measure how well two regions "fit together":
    $fill(r_i, r_j) = 1 - \frac{size(BB_{ij} - size(r_i) - size(r_j))}{size(im)}$

3. Bounding Box Regression

The R-CNN model predicted a bounding box coordinate $p = (p_x, p_y, p_w, p_h)$ , (center coordinates, width, height). The ground truth box coordinates are $g = (g_x, g_y, g_w, g_h)$ . The transformation function are:

$\hat{g_x} = p_w d_x(p)+p_x\\ \hat{g_y} = p_h d_y(p)+p_y\\ \hat{g_w} = p_w e^{d_w(p)}\\ \hat{g_h} = p_h e^{d_h(p)}\\$

Therefore the boundign box correction functions, $d_i(p)$ , where $i \in \{x, y, w, h\}$ , can be represented as targets below:

$t_x = (g_x - p_x)/p_w \\ t_y = (g_y - p_y)/p_h \\ t_w = log(g_w/p_w) \\ t_h = log(g_h/p_h) \\$

Furthermore, we can solve this problem by minimizing the SSE loss with regularization:

$L = \sum_{i\in\{x,y,w,h\}} (t_i - d_i(p))^2 + \lambda||w||^2$

Note: since not all the predicted bounding boxes have corresponding ground truth boxes, only a predicted box with a nearby ground truth box with at least 0.6 IoU is kept for training the bbox regression model

4. Non-Max Suppression

Dealing with multiple bounding box for the same object

With a set of matched bounding boxes for the same object:
- Sort all the bounding boxes by confidence score, and discard boxes with low confidence scores
- With the remaining boxes,
  - Greedily select the one with the highest score
  - Skip the remaining boxes with high IoU with previous selected ones.

5. Use hard negative mining data

For negative examples, not all of them are equally hard to be identified.

If it is just pure empty background, it is "easy negative"
If it contains partial object, it could be "hard negative"

==> We can use those false positive samples during training loop and include them in the training data to improve the performance

6. Speed bottleneck

Running selective search to propose 2000 region candidates for every image;
Generating the CNN feature vector for every image region (N images * 2000).
The whole process involves three models separately without much shared computation: the convolutional neural network for image classification and feature extraction; the top SVM classifier for identifying target objects; and the regression model for tightening region bounding boxes.

PreviousResNet NextFast R-CNN

Last updated 5 years ago

Was this helpful?