AlexNet
1. Architecture

AlexNet has 8 layers: 5 convolutional layers and 3 Fully Connected (FC) layers.
Convolutional Layer 1: 96 kernels with size (stride = 4, pad = 0): => feature maps
Overlapping Max pooling with size (stride = 2): => feature maps
Local Response Normalization: => feature maps
Convolutional Layer 2: 256 kernels with size (stride = 1, pad = 2): => feature maps
Overlapping Max pooling with size (stride = 2): => feature maps
Local Response Normalization: => feature maps
Convolutional Layer 3: 384 kernels with size (stride = 1, pad = 1): => feature maps
Convolutional Layer 4: 384 kernels with size (stride = 1, pad = 1): => feature maps
Convolutional Layer 5: 256 kernels with size (stride = 1, pad = 1): => feature maps
Overlapping Max pooling with size (stride = 2): => feature maps
Fully Connected Layer 1: 4096 neurons
Fully Connected Layer 2: 4096 neurons
Fully Connected Layer 3: 1000 neurons:
Output 1000 neurons (1000 classes)
Softmax is used for calculating the loss
Overlapping Pooling
Overlapping pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.
It is used to mitigate the overfitting issue since it smoothed the changes between pixels
Dropout
At each training stage: individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.
Prevent overfitting: A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.
How to dropout:
Training phase: For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, p, of nodes (and corresponding activations).
Testing phase: Use all activations, but reduce them by a factor p (to account for the missing activations during training).
Impact of dropout:
Force a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
Doubles the number of iterations required to converge. However, the training time for each epoch is less
With H hidden units, each of which can be dropped, we have possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.
Last updated
Was this helpful?