ResNet

0. Idea:

ResNet is used to solve the idea that deeper network can get better result. In a plain network, the deeper network cannot get better result since the gradient vanishing/explosion problem. So ResNet gets skip/shortcut connections to overcome it. In the worst case, the ResNet can just skip all of the networks and form a shallow network to maintain the performance.

1. Architecture (ResNet-34, 34-layer plain, VGG-19)

The three networks are
- Top: 34-layer ResNet with Skip / Shortcut Connection: the plain one with addition of skip / shortcut connection.
- Middle: 34-layer Plain Network: treated as the deeper network of VGG-19
- Bottom: 19-layer VGG-19

2. Motivation of ResNet

2.1. Problems of plain network: vanishing/ exploding gradients

We use plain network with no skip/shortcut connection, when the network get deeper, the vanishing/exploding gradients occurs. In back propagation, when partial derivative of the error function with respect to the current weight ==> It has the effect of multiplying n of these small / large numbers to compute gradients of front layers.

Vanished: multiplying n of small numbers ==> 0
Exploded: multiplying n of large numbers ==> too large

Solutions:

ResNet: Skip/ Shortcut connections
A smaller batch size
LSTM: use gate related neuron structures
Use gradient clipping: when the gradient lower than a threshold, just set the gradient as the clipping value, usually is 0.5
Weight regulation: add a L1 or L2 penalty regulation to help with exploding the gradients

2.2. Skip/ Shortcut connection in ResNet

The output $H(x) = F(x) + x$ , the weight layers is to learn a residual mapping: $F(x) = H(x) - x$

If there is vanishing gradient for the weight layers ==> the identify $x$ can be transfer back to earlier layers ==> added back the vanished gradients

2.3 Two types of residual connections

The identity shortcuts (x): When the input and output are the same dimensions. ==> No extra parameters
- $H(x) = F(x) + x = F(x, \{W_i\}) + x$
Input/output Dimension changes: ==> Added extra parameters $W_sx$
- Perform identity mapping with extra zero entries padded with the increased dimension
- Projection shortcut to match the dimension with 1x1 CONV layer
- $H(x) = F(x) + x = F(x, \{W_i\}) + W_sx$

3. Bottleneck design

Since the network is very deep now, the time complexity is high. A bottleneck design is used to reduce the complexity

How to add
- 1x1 CONV layers are added to the start and end of network
Why?
- 1×1 CONV can reduce the number of connections (parameters) while not degrading the performance of the network so much.
After adding a bottleneck, a ResNet-34 becomes ResNet-50:

PreviousGoogLeNet NextR-CNN

Last updated 5 years ago

Was this helpful?