VGGNet

1. Architecture (VGG-16 and VGG-19)

Ablation study on the different deep learning structure
  • VGG-11 architecture (224x224x3)

    • Convolution layer with 64 filters with size 3x3 kernels

      • Max pooling layer with filter size 2x2

      => 112x112x64

    • Convolution layer with 128 filters with size 3x3 kernels

      • Max pooling layer with filter size 2x2

      => 56x56x128

    • Convolution layer with 256 filters with size 3x3 kernels

    => 56x56x256

    • Convolution layer with 256 filters with size 3x3 kernels

      • Max pooling layer with filter size 2x2

      => 28x28x256

    • Convolution layer with 512 filters with size 3x3 kernels

      => 28x28x256

    • Convolution layer with 512 filters with size 3x3 kernels

      • Max pooling layer with filter size 2x2

      => 14x14x512

    • Convolution layer with 512 filters with size 3x3 kernels

    => 14x14x256

    • Convolution layer with 512 filters with size 3x3 kernels

      • Max pooling layer with filter size 2x2

      => 7x7x512 = 25088

    • Fully connected layer: 4096 neurons

    • Fully connected layer: 4096 neurons

    • Fully connected layer: 1000 neurons (1000 classes)

      • Softmax

  • VGG-11 (LRN) is the one with additional local response normalization (LRN) operation suggested by AlexNet.

  • VGG-13 added two additional 3x3 convolutional layers. The performance increases.

  • VGG-16 (conv-1) added three 1×1 conv layers help the classification accuracy

    • conv-1 helps to increase non-linearlity of the decision function.

    • 1×1 conv is doing the projection mapping in the same high dimensionality.

  • VGG-16 replaced the 3 1x1 convolutional layers by 3x3 conv layers

  • VGG-19 added additional three 3×3 conv layers help the classification accuracy

Concrete example of VGG-16

2. Technical details

2.1 Why use small filters: 3x3 kernel

  • Small filter covers the information of larger filters

    • 2 layers of 3x3 kernel covers 5x5 kernel

    • 3 layers of 3x3 kernel covers 7x7 kernel

    • 5 layers of 3x3 kernel covers 11x11 kernel

  • Number of parameters are fewer:

    • 1 layer of 11x11 kernel v.s. 5 layers of 3x3 kernels

      • 1 layer of 11x11 kernel have 11x11=121 parameters

      • 5 layers of 3x3 kernels have 3x3x5 = 45 parameters

    • 1 layer of 7x7 kernel v.s. 3 layers of 3x3 kernels

      • 1 layer of 7x7 kernel have 7x7=49 parameters

      • 3 layers of 3x3 kernels have 3x3x3 = 27 parameters

  • ==> Faster convergence and reduce overfitting problem

2.2 Why use 1x1 convolution layer

  • Problem of too many feature maps

    • Cause convolutional operation must be performed down through the depth of the input, especially with relative large kernels, 5x5, 7x7

  • Downsample with 1x1 filters

    • A 1×1 filter will only have a single parameter or weight for each channel in the input

    • It can be used for channel-wise pooling.

    • This simple technique can be used for dimensionality reduction, decreasing the number of feature maps whilst retaining their salient features.

2.3 Multi-scale training

  • Motivation

    • As object has different scale within the image, train the network at the same scale may miss the detection for the objects with other scales.

  • For single-scale training, an image is scaled with smaller-size equal to 256 or 384. Then, the scaled image will be cropped to 224x224.

  • For multi-scale training, an image is scaled with smaller-size equal to a range from 256 to 512, then cropped to 224x224.

2.4 Multi-scale testing

  • Motivation

    • Multi-scale testing can also reduce the error rate since we do not know the size of object in the test image.

  • Scaling the test image to different sizes, we can increase the chance of correct classification.

2.5 Convolutional testing: FC layer -> Conv layer in testing

VGG training structure
VGG testing structure
  • In VGGNet, the testing structure is different from the training structure:

    • The first FC is replace by 7x7 conv

    • The second and third FC are replaced by 1x1 conv

  • Why use Conv layer to replace FC layer:

    • The only difference between FC and CONV layers is the neurons in the CONV layer are connected only to a local region in the input

    • For any CONV layer, there is an FC layer that implements the same forward function

    • For any FC layer, it can be converted to CONV layer:

      • Setting the filter size of CONV to be the size of input volume

      • Set the number of CONV filters be the same size of FC layer

      • For example, for an FC layer with K = 4096 and the input volume size is 7x7x512,

      • We can replace it with a CONV layer with 7x7 filter with 4096 filters

    • It can be used to deal with the images with different sizes.

    • Forwarding the converted CONV a single time is much more efficient than iterating the original CONV over the different locations

Last updated

Was this helpful?