1. Architecture
GoogLeNet can go very deep. It involves 3 techniques to make this happen:
2. 1x1 Convolution
The 1x1 convolution layer is used as a dimension reduction module to reduce the computation. By reducing the computation bottleneck, depth and width can be increased.
5x5 kernel convolution without 1x1 CONV kernel:
Number of operations = (14x14x48) x (5x5x480) = 112.9 M
Number of operations for 1x1 kernel = (14x14x16) x (1x1x480) = 1.5 M
Number of operations for 5x5 kernel = (14x14x48) x (5x5x16) = 3.8 M
==> Indeed, we map operations from high dimension to low dimension in a non-linear way
3. Inception Module
The inception module is used to extract different kinds of features from the same layer.
Inception model (No 1x1 convolution) The 1×1 conv, 3×3 conv, 5×5 conv, and 3×3 max pooling are done altogether for the previous input, and stack together again at output.
==> When image’s coming in, different sizes of convolutions as well as max pooling are tried.
To reduce the operations in the inception module, we used 1x1 convolution layers before the normal convolution layers:
Inception module (With 1x1 convolution) 4. Global Average Pooling (GAP)
Fully Connected Layer VS Global Average Pooling For a fully connected (FC) layer, the number of weights = 7x7x1024x1024 = 51.3 M
For GoogLeNet, the global average pooling by averaging each feature map from 7x7 to 1x1. The number of weights = 0 M
5. Auxiliary Classifiers for training:
=> For combating gradient vanishing problem and provide regulations
The softmax branches at the middle are used for training only. The classifiers consist of
5x5 average pooling (stride 3)
==> The loss is added to the total loss with weight 0.3