R-CNN
Region-CNN (R-CNN), which is a CNN-based deep learning object detection approaches.
Difficulties in object detection: (1) Objects have different aspect ratios and sizes; (2) Different image sizes also affect the window size. ==> Extremely slow to use CNN for image classification at each location
0. Idea of R-CNN:
R-CNN uses selective search to generate about 2K region proposals, i.e. bounding boxes for image classification.
For each bounding box, using CNN to perform image classification.
Each bounding box can be refined using regression.

1. Model Workflow

How R-CNN works:
Pre-train a CNN network on image classification tasks, the classification task involves N classes
Propose category-indepent regions of interest by selective search. ~2K candidates per image
Warping region candidates to have a fixed size as required by CNN
Continue fine-tuning the CNN on warped proposal regions for K+1 classes.
The additional one class refers to the background (no object of interest).
The learning rate should be small in this stage
Use mini-batch to oversamples the positive cases since most proposed regions are just background
For each candidate image region, one forward propagation through CNN to generates a feature vector.
Then, use binary SVM to train each class independently
The positive samples are proposed regions with IoU overlap threshold >= 0.3, and negative samples are irrelevent others.
To reduce the localization errors, we use a regression model to correct the predicted detection window on bounding box correction offset using CNN features.
2. Selective Search
How Selective Search Works:
At the initialization stage, apply Felzenszwalb’s algorithm on graph-based image segmentation to create regions to start with.
Use a greedy algorithm to iteratively group regions together:
Calculate the similarities between all neighboring regions
Group the two most similar regions, and new similarities are calculated between the merged resulting region and its neighbors
Repeat the grouping step until the whole image becomes a single region
Felzenszwalb’s Algorithm (Image Segmentation)
We use an undirected graph G = (V, E) to represent an input image
Vertex represents one pixel
Edge connects two vertices v_i and v_j
Weight measures the dissimilarity between v_i and v_j. The higher the weight, the less similar between two vertices
Segmentation is a partition of V into multiple connected components,
Key concepts for a good graph partition (image segmentation):
Internal difference: , where MST is the minimum spanning tree of the components. A component C can still remain connected even when we have removed all the edges with condition: weights < Int(C).
Difference between two components: . if there is no edge in-between
Minimum internal difference: ,
where ,
==> ensure a meaningful threshold for the difference between components
==> A higher K --> a larger components
Quality of a segmentation is assessed by a pairwise region comparison, C_1 and C_2:
Only if the predict holds True, we consider them as two independent components, otherwise, they may should be merged
How Felzenszwalb’s Algorithm works: G = (V, E) and |V| = n, and |E| = m
Edges are sorted by weight in ascending order, labeled as e1, e2, e3, ..., em.
All pixels stay in its own component and we start with n components
Repeat for k = 1, ..., m:
The segmentation snapshot at the step k is denoted as
We take the k-th edge in the order,
if v_i and v_j belong to the different components and as in the segmentation , we want to merge them into one if ; otherwise do nothing
Similarity between neighboring regions
Goals:
Use multiple grouping criteria
Lead to a balanced hierarchy of small to large objects
Be efficient to compute: quickly combine measurements in two regions
For two regions , selective search combines four similarities:
Color:
Create a color histogram C for each channel in region r.
Texture:
Extract gaussion derivative of the images in 8 directions and for each channel
Construct a 10-bin histogram for each, resulting in a 240-dim descriptor
Size:
We want small regions to merge into larger ones, to create a balanced hierarchy
Add a size component to ensure small regions are more similar to each other
Shape:
Measure how well two regions "fit together":
3. Bounding Box Regression
The R-CNN model predicted a bounding box coordinate , (center coordinates, width, height). The ground truth box coordinates are . The transformation function are:
Therefore the boundign box correction functions, , where , can be represented as targets below:
Furthermore, we can solve this problem by minimizing the SSE loss with regularization:
Note: since not all the predicted bounding boxes have corresponding ground truth boxes, only a predicted box with a nearby ground truth box with at least 0.6 IoU is kept for training the bbox regression model
4. Non-Max Suppression
Dealing with multiple bounding box for the same object
With a set of matched bounding boxes for the same object:
Sort all the bounding boxes by confidence score, and discard boxes with low confidence scores
With the remaining boxes,
Greedily select the one with the highest score
Skip the remaining boxes with high IoU with previous selected ones.
5. Use hard negative mining data
For negative examples, not all of them are equally hard to be identified.
If it is just pure empty background, it is "easy negative"
If it contains partial object, it could be "hard negative"
==> We can use those false positive samples during training loop and include them in the training data to improve the performance
6. Speed bottleneck
Running selective search to propose 2000 region candidates for every image;
Generating the CNN feature vector for every image region (N images * 2000).
The whole process involves three models separately without much shared computation: the convolutional neural network for image classification and feature extraction; the top SVM classifier for identifying target objects; and the regression model for tightening region bounding boxes.
Last updated
Was this helpful?