AlexNet

1. Architecture

AlexNet has 8 layers: 5 convolutional layers and 3 Fully Connected (FC) layers.

  • Convolutional Layer 1: 96 kernels with size 11×11×311\times11\times3(stride = 4, pad = 0): => 55×55×9655\times 55\times 96 feature maps

    • Overlapping Max pooling with size 3×33\times 3(stride = 2): => 27×27×9627\times 27\times 96 feature maps

    • Local Response Normalization: => 27×27×9627\times 27\times 96 feature maps

  • Convolutional Layer 2: 256 kernels with size 5×5×965\times5\times96(stride = 1, pad = 2): => 27×27×25627\times 27\times 256 feature maps

    • Overlapping Max pooling with size 3×33\times 3(stride = 2): => 13×13×25613\times 13\times 256 feature maps

    • Local Response Normalization: => 13×13×25613\times 13\times 256 feature maps

  • Convolutional Layer 3: 384 kernels with size 3×3×2563\times3\times256(stride = 1, pad = 1): => 13×13×38413\times 13\times 384 feature maps

  • Convolutional Layer 4: 384 kernels with size 3×3×3843\times3\times384(stride = 1, pad = 1): => 13×13×38413\times 13\times 384 feature maps

  • Convolutional Layer 5: 256 kernels with size 3×3×1923\times3\times192(stride = 1, pad = 1): => 13×13×25613\times 13\times 256 feature maps

    • Overlapping Max pooling with size 3×33\times 3(stride = 2): => 6×6×2566\times 6\times 256 feature maps

  • Fully Connected Layer 1: 4096 neurons

  • Fully Connected Layer 2: 4096 neurons

  • Fully Connected Layer 3: 1000 neurons:

    • Output 1000 neurons (1000 classes)

    • Softmax is used for calculating the loss

Overlapping Pooling

Overlapping pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.

It is used to mitigate the overfitting issue since it smoothed the changes between pixels

Dropout

At each training stage: individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Prevent overfitting: A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.

How to dropout:

  • Training phase: For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, p, of nodes (and corresponding activations).

  • Testing phase: Use all activations, but reduce them by a factor p (to account for the missing activations during training).

Impact of dropout:

  • Force a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

  • Doubles the number of iterations required to converge. However, the training time for each epoch is less

  • With H hidden units, each of which can be dropped, we have 2H2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

Last updated

Was this helpful?