Regularizations
Last updated
Last updated
Overall, we want a simpler model to get rid of overfitting.
L2 regularization: SUMk SUMl ( Wk l^ 2 )
It works in this following way: by using L2 regularization, W [0.25, 0.25, 0.25, 0.25] is more preferred than W [1, 0, 0, 0], so that the decision made would be counting on all 4 input features, and the final guess will have looked into more features rather than only one.
L2 regularization also corresponds MAP inference using a Gaussian prior on W
L1 regularization: SUMk SUMl ( | Wkl | ). L1 will force the model to be more sparse
In the other way, L1 regularization kind of have the opposite L2 interpretation. And we would prefer W [1, 0, 0, 0] more than W [1, 1, 1, 1].
Elastic net ( L1 + L2 ) : SUMk SUMl ( beta * Wkl^2 + | Wkl |)
Dropout: set random activations zero (for FC layer), or random channels to zero for Convolution layers.