Activation Functions

Sigmoid

  1. Have linear region around 0 input, but squashes activation numbers to [0, 1]

  2. But Saturated neurons "kill" the gradients

  3. Sigmoid gradient is not zero centered, then all activations would be positive; in y = W*x + b, W's derivative is x and it means dL/dW would always be same +/- for all parameters. All weights either increase together or decrease together.

4. Why not zero centered is bad? When dL/dW would always be same +/- for all parameter, to reach the hypothetical optimal w, we have to do multiple zig zag updates to arrive there. Because the gradients (the update directions, the red arrow directions) are only allowed in the right-top and left-bottom quadrants. These would be very inefficient gradient updates.

5. So this is also why, in general, we want to have zero mean data. So that we have both positive and negative values and won't get into this all W moving in the same direction problem.

6. exp() is a bit computation expensive.

tanh (pronounced as tan-h, h like h i j k l m n)

  1. Zero centered (pros)

  2. Still kills gradients when saturated (cons)

ReLU (pronounced as re-lu, re like repeat)

  1. f(x) = max(0, x), computation efficient (pros)

  2. Does not saturate in + region (pros)

  3. Converges much faster than sigmoid/tanh in practice (e.g. 6x)

  4. AlexNet in 2012 uses ReLU in their experience and it is the first major CNN doing well on ImageNet and large-scale data. Since then, ReLU has started been used a lot. (pros)

  5. Not zero-centered output (cons)

  6. In negative half, there would be zero gradients; and at the origin, it's not defined. (cons)

  7. In an abstract way, each ReLU neuron is trying to separate the plane into half, and for half of the plane it will be activated. So there could be some ReLUs that wouldn't have any gradients update ( zero gradient) and be dead forever, like the red one. Two reasons for it:

    1. Bad initialization weights. If we have weights that happen to be unlucky and happen to be off the data cloud, like the red one. Then they are never going to have a data input that causes it to activate, so that never good gradient flow coming back -- dead forever.

    2. Learning rate is too high.

ReLU6: ( activation = min(max(features, 0), 6) why 6 here? )

From this reddit thread:

This is useful in making the networks ready for fixed-point inference. If you unbound the upper limit, you lose too many bits to the Q part of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a max of 3 bits (upto 8) leaving 4/5 bits for .f

It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.

Last updated