VGGNet

1. Architecture (VGG-16 and VGG-19)

VGG-11 architecture (224x224x3)
- Convolution layer with 64 filters with size 3x3 kernels
  - Max pooling layer with filter size 2x2
  => 112x112x64
- Convolution layer with 128 filters with size 3x3 kernels
  - Max pooling layer with filter size 2x2
  => 56x56x128
- Convolution layer with 256 filters with size 3x3 kernels
=> 56x56x256
- Convolution layer with 256 filters with size 3x3 kernels
  - Max pooling layer with filter size 2x2
  => 28x28x256
- Convolution layer with 512 filters with size 3x3 kernels
  => 28x28x256
- Convolution layer with 512 filters with size 3x3 kernels
  - Max pooling layer with filter size 2x2
  => 14x14x512
- Convolution layer with 512 filters with size 3x3 kernels
=> 14x14x256
- Convolution layer with 512 filters with size 3x3 kernels
  - Max pooling layer with filter size 2x2
  => 7x7x512 = 25088
- Fully connected layer: 4096 neurons
- Fully connected layer: 4096 neurons
- Fully connected layer: 1000 neurons (1000 classes)
  - Softmax
VGG-11 (LRN) is the one with additional local response normalization (LRN) operation suggested by AlexNet.
VGG-13 added two additional 3x3 convolutional layers. The performance increases.
VGG-16 (conv-1) added three 1×1 conv layers help the classification accuracy
- conv-1 helps to increase non-linearlity of the decision function.
- 1×1 conv is doing the projection mapping in the same high dimensionality.
VGG-16 replaced the 3 1x1 convolutional layers by 3x3 conv layers
VGG-19 added additional three 3×3 conv layers help the classification accuracy

2. Technical details

2.1 Why use small filters: 3x3 kernel

Small filter covers the information of larger filters
- 2 layers of 3x3 kernel covers 5x5 kernel
- 3 layers of 3x3 kernel covers 7x7 kernel
- 5 layers of 3x3 kernel covers 11x11 kernel
Number of parameters are fewer:
- 1 layer of 11x11 kernel v.s. 5 layers of 3x3 kernels
  - 1 layer of 11x11 kernel have 11x11=121 parameters
  - 5 layers of 3x3 kernels have 3x3x5 = 45 parameters
- 1 layer of 7x7 kernel v.s. 3 layers of 3x3 kernels
  - 1 layer of 7x7 kernel have 7x7=49 parameters
  - 3 layers of 3x3 kernels have 3x3x3 = 27 parameters
==> Faster convergence and reduce overfitting problem

2.2 Why use 1x1 convolution layer

Problem of too many feature maps
- Cause convolutional operation must be performed down through the depth of the input, especially with relative large kernels, 5x5, 7x7
Downsample with 1x1 filters
- A 1×1 filter will only have a single parameter or weight for each channel in the input
- It can be used for channel-wise pooling.
- This simple technique can be used for dimensionality reduction, decreasing the number of feature maps whilst retaining their salient features.

2.3 Multi-scale training

Motivation
- As object has different scale within the image, train the network at the same scale may miss the detection for the objects with other scales.
For single-scale training, an image is scaled with smaller-size equal to 256 or 384. Then, the scaled image will be cropped to 224x224.
For multi-scale training, an image is scaled with smaller-size equal to a range from 256 to 512, then cropped to 224x224.

2.4 Multi-scale testing

Motivation
- Multi-scale testing can also reduce the error rate since we do not know the size of object in the test image.
Scaling the test image to different sizes, we can increase the chance of correct classification.

2.5 Convolutional testing: FC layer -> Conv layer in testing

In VGGNet, the testing structure is different from the training structure:
- The first FC is replace by 7x7 conv
- The second and third FC are replaced by 1x1 conv
Why use Conv layer to replace FC layer:
- The only difference between FC and CONV layers is the neurons in the CONV layer are connected only to a local region in the input
- For any CONV layer, there is an FC layer that implements the same forward function
- For any FC layer, it can be converted to CONV layer:
  - Setting the filter size of CONV to be the size of input volume
  - Set the number of CONV filters be the same size of FC layer
  - For example, for an FC layer with K = 4096 and the input volume size is 7x7x512,
  - We can replace it with a CONV layer with 7x7 filter with 4096 filters
- It can be used to deal with the images with different sizes.
- Forwarding the converted CONV a single time is much more efficient than iterating the original CONV over the different locations
- http://cs231n.github.io/convolutional-networks/#convert

PreviousAlexNet NextGoogLeNet

Last updated 5 years ago

Was this helpful?