ResNet
0. Idea:
ResNet is used to solve the idea that deeper network can get better result. In a plain network, the deeper network cannot get better result since the gradient vanishing/explosion problem. So ResNet gets skip/shortcut connections to overcome it. In the worst case, the ResNet can just skip all of the networks and form a shallow network to maintain the performance.
1. Architecture (ResNet-34, 34-layer plain, VGG-19)

The three networks are
Top: 34-layer ResNet with Skip / Shortcut Connection: the plain one with addition of skip / shortcut connection.
Middle: 34-layer Plain Network: treated as the deeper network of VGG-19
Bottom: 19-layer VGG-19
2. Motivation of ResNet
2.1. Problems of plain network: vanishing/ exploding gradients
We use plain network with no skip/shortcut connection, when the network get deeper, the vanishing/exploding gradients occurs. In back propagation, when partial derivative of the error function with respect to the current weight ==> It has the effect of multiplying n of these small / large numbers to compute gradients of front layers.
Vanished: multiplying n of small numbers ==> 0
Exploded: multiplying n of large numbers ==> too large
Solutions:
ResNet: Skip/ Shortcut connections
A smaller batch size
LSTM: use gate related neuron structures
Use gradient clipping: when the gradient lower than a threshold, just set the gradient as the clipping value, usually is 0.5
Weight regulation: add a L1 or L2 penalty regulation to help with exploding the gradients
2.2. Skip/ Shortcut connection in ResNet

The output H(x)=F(x)+x, the weight layers is to learn a residual mapping: F(x)=H(x)−x
If there is vanishing gradient for the weight layers ==> the identify xcan be transfer back to earlier layers ==> added back the vanished gradients
2.3 Two types of residual connections
The identity shortcuts (x): When the input and output are the same dimensions. ==> No extra parameters
H(x)=F(x)+x=F(x,{Wi​})+x
Input/output Dimension changes: ==> Added extra parameters Ws​x
Perform identity mapping with extra zero entries padded with the increased dimension
Projection shortcut to match the dimension with 1x1 CONV layer
H(x)=F(x)+x=F(x,{Wi​})+Ws​x
3. Bottleneck design
Since the network is very deep now, the time complexity is high. A bottleneck design is used to reduce the complexity

How to add
1x1 CONV layers are added to the start and end of network
Why?
1×1 CONV can reduce the number of connections (parameters) while not degrading the performance of the network so much.
After adding a bottleneck, a ResNet-34 becomes ResNet-50:

Last updated