VGGNet
1. Architecture (VGG-16 and VGG-19)

VGG-11 architecture (224x224x3)
Convolution layer with 64 filters with size 3x3 kernels
Max pooling layer with filter size 2x2
=> 112x112x64
Convolution layer with 128 filters with size 3x3 kernels
Max pooling layer with filter size 2x2
=> 56x56x128
Convolution layer with 256 filters with size 3x3 kernels
=> 56x56x256
Convolution layer with 256 filters with size 3x3 kernels
Max pooling layer with filter size 2x2
=> 28x28x256
Convolution layer with 512 filters with size 3x3 kernels
=> 28x28x256
Convolution layer with 512 filters with size 3x3 kernels
Max pooling layer with filter size 2x2
=> 14x14x512
Convolution layer with 512 filters with size 3x3 kernels
=> 14x14x256
Convolution layer with 512 filters with size 3x3 kernels
Max pooling layer with filter size 2x2
=> 7x7x512 = 25088
Fully connected layer: 4096 neurons
Fully connected layer: 4096 neurons
Fully connected layer: 1000 neurons (1000 classes)
Softmax
VGG-11 (LRN) is the one with additional local response normalization (LRN) operation suggested by AlexNet.
VGG-13 added two additional 3x3 convolutional layers. The performance increases.
VGG-16 (conv-1) added three 1×1 conv layers help the classification accuracy
conv-1 helps to increase non-linearlity of the decision function.
1×1 conv is doing the projection mapping in the same high dimensionality.
VGG-16 replaced the 3 1x1 convolutional layers by 3x3 conv layers
VGG-19 added additional three 3×3 conv layers help the classification accuracy

2. Technical details
2.1 Why use small filters: 3x3 kernel
Small filter covers the information of larger filters
2 layers of 3x3 kernel covers 5x5 kernel
3 layers of 3x3 kernel covers 7x7 kernel
5 layers of 3x3 kernel covers 11x11 kernel
Number of parameters are fewer:
1 layer of 11x11 kernel v.s. 5 layers of 3x3 kernels
1 layer of 11x11 kernel have 11x11=121 parameters
5 layers of 3x3 kernels have 3x3x5 = 45 parameters
1 layer of 7x7 kernel v.s. 3 layers of 3x3 kernels
1 layer of 7x7 kernel have 7x7=49 parameters
3 layers of 3x3 kernels have 3x3x3 = 27 parameters
==> Faster convergence and reduce overfitting problem
2.2 Why use 1x1 convolution layer
Problem of too many feature maps
Cause convolutional operation must be performed down through the depth of the input, especially with relative large kernels, 5x5, 7x7
Downsample with 1x1 filters
A 1×1 filter will only have a single parameter or weight for each channel in the input
It can be used for channel-wise pooling.
This simple technique can be used for dimensionality reduction, decreasing the number of feature maps whilst retaining their salient features.
2.3 Multi-scale training
Motivation
As object has different scale within the image, train the network at the same scale may miss the detection for the objects with other scales.
For single-scale training, an image is scaled with smaller-size equal to 256 or 384. Then, the scaled image will be cropped to 224x224.
For multi-scale training, an image is scaled with smaller-size equal to a range from 256 to 512, then cropped to 224x224.
2.4 Multi-scale testing
Motivation
Multi-scale testing can also reduce the error rate since we do not know the size of object in the test image.
Scaling the test image to different sizes, we can increase the chance of correct classification.
2.5 Convolutional testing: FC layer -> Conv layer in testing


In VGGNet, the testing structure is different from the training structure:
The first FC is replace by 7x7 conv
The second and third FC are replaced by 1x1 conv
Why use Conv layer to replace FC layer:
The only difference between FC and CONV layers is the neurons in the CONV layer are connected only to a local region in the input
For any CONV layer, there is an FC layer that implements the same forward function
For any FC layer, it can be converted to CONV layer:
Setting the filter size of CONV to be the size of input volume
Set the number of CONV filters be the same size of FC layer
For example, for an FC layer with K = 4096 and the input volume size is 7x7x512,
We can replace it with a CONV layer with 7x7 filter with 4096 filters
It can be used to deal with the images with different sizes.
Forwarding the converted CONV a single time is much more efficient than iterating the original CONV over the different locations
Last updated
Was this helpful?