ResNet

0. Idea:

ResNet is used to solve the idea that deeper network can get better result. In a plain network, the deeper network cannot get better result since the gradient vanishing/explosion problem. So ResNet gets skip/shortcut connections to overcome it. In the worst case, the ResNet can just skip all of the networks and form a shallow network to maintain the performance.

1. Architecture (ResNet-34, 34-layer plain, VGG-19)

  • The three networks are

    • Top: 34-layer ResNet with Skip / Shortcut Connection: the plain one with addition of skip / shortcut connection.

    • Middle: 34-layer Plain Network: treated as the deeper network of VGG-19

    • Bottom: 19-layer VGG-19

2. Motivation of ResNet

2.1. Problems of plain network: vanishing/ exploding gradients

We use plain network with no skip/shortcut connection, when the network get deeper, the vanishing/exploding gradients occurs. In back propagation, when partial derivative of the error function with respect to the current weight ==> It has the effect of multiplying n of these small / large numbers to compute gradients of front layers.

  • Vanished: multiplying n of small numbers ==> 0

  • Exploded: multiplying n of large numbers ==> too large

Solutions:

  • ResNet: Skip/ Shortcut connections

  • A smaller batch size

  • LSTM: use gate related neuron structures

  • Use gradient clipping: when the gradient lower than a threshold, just set the gradient as the clipping value, usually is 0.5

  • Weight regulation: add a L1 or L2 penalty regulation to help with exploding the gradients

2.2. Skip/ Shortcut connection in ResNet

Skip/ Shortcut connection

The output H(x)=F(x)+xH(x) = F(x) + x, the weight layers is to learn a residual mapping: F(x)=H(x)−xF(x) = H(x) - x

If there is vanishing gradient for the weight layers ==> the identify xxcan be transfer back to earlier layers ==> added back the vanished gradients

2.3 Two types of residual connections

  • The identity shortcuts (x): When the input and output are the same dimensions. ==> No extra parameters

    • H(x)=F(x)+x=F(x,{Wi})+xH(x) = F(x) + x = F(x, \{W_i\}) + x

  • Input/output Dimension changes: ==> Added extra parameters WsxW_sx

    • Perform identity mapping with extra zero entries padded with the increased dimension

    • Projection shortcut to match the dimension with 1x1 CONV layer

    • H(x)=F(x)+x=F(x,{Wi})+WsxH(x) = F(x) + x = F(x, \{W_i\}) + W_sx

3. Bottleneck design

Since the network is very deep now, the time complexity is high. A bottleneck design is used to reduce the complexity

The Basic Block (Left) and The Proposed Bottleneck Design (Right)
  • How to add

    • 1x1 CONV layers are added to the start and end of network

  • Why?

    • 1×1 CONV can reduce the number of connections (parameters) while not degrading the performance of the network so much.

  • After adding a bottleneck, a ResNet-34 becomes ResNet-50:

Last updated

Was this helpful?