AlexNet

1. Architecture

AlexNet has 8 layers: 5 convolutional layers and 3 Fully Connected (FC) layers.

Convolutional Layer 1: 96 kernels with size $11\times11\times3$ (stride = 4, pad = 0): => $55\times 55\times 96$ feature maps
- Overlapping Max pooling with size $3\times 3$ (stride = 2): => $27\times 27\times 96$ feature maps
- Local Response Normalization: => $27\times 27\times 96$ feature maps
Convolutional Layer 2: 256 kernels with size $5\times5\times96$ (stride = 1, pad = 2): => $27\times 27\times 256$ feature maps
- Overlapping Max pooling with size $3\times 3$ (stride = 2): => $13\times 13\times 256$ feature maps
- Local Response Normalization: => $13\times 13\times 256$ feature maps
Convolutional Layer 3: 384 kernels with size $3\times3\times256$ (stride = 1, pad = 1): => $13\times 13\times 384$ feature maps
Convolutional Layer 4: 384 kernels with size $3\times3\times384$ (stride = 1, pad = 1): => $13\times 13\times 384$ feature maps
Convolutional Layer 5: 256 kernels with size $3\times3\times192$ (stride = 1, pad = 1): => $13\times 13\times 256$ feature maps
- Overlapping Max pooling with size $3\times 3$ (stride = 2): => $6\times 6\times 256$ feature maps
Fully Connected Layer 1: 4096 neurons
Fully Connected Layer 2: 4096 neurons
Fully Connected Layer 3: 1000 neurons:
- Output 1000 neurons (1000 classes)
- Softmax is used for calculating the loss

Overlapping Pooling

Overlapping pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.

It is used to mitigate the overfitting issue since it smoothed the changes between pixels

Dropout

At each training stage: individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Prevent overfitting: A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.

How to dropout:

Training phase: For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, p, of nodes (and corresponding activations).
Testing phase: Use all activations, but reduce them by a factor p (to account for the missing activations during training).

Impact of dropout:

Force a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
Doubles the number of iterations required to converge. However, the training time for each epoch is less
With H hidden units, each of which can be dropped, we have $2^H$ possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

PreviousTable Detection NextVGGNet

Last updated 5 years ago

Was this helpful?