Why do we need activation functions?
We use activation function to introduce non-linearity into the neural network. Activation functions convert independent variables of near infinite range into simple probabilities between 0 and 1. Now the question is, why do we need non-linearities?
Let’s take a linear equation, y = x. If we plot it will look like this
Linear functions always form a straight line. Even if you add more dimensions, it always forms a straight line.
Let’s take a nonlinear equation,y = x2. Now if we plot it, as you can see have curves.
Since neural networks are “Universal Approximators,” it can compute any function.
However, if we don’t have a non-linear activation function, it always produces a linear output.
Remember a neural network without an activation function is just a linear regression model.
Another important thing about activation functions is that it should be differentiable.
Why because we use backpropagation to calculate the gradient of the error function with respect to the weights. A differentiable activation function is necessary to perform backpropagation.
Now let’s see some of the commonly used activation functions.
The most simple activation function is the step function which is used by the perceptron model. It is a threshold based activation function.
. f(x) = x if x >= 0
. 0 otherwise
As we can see from the equation above, this is a straightforward threshold function.
If the input value is greater than a particular threshold then it returns the same value, else it returns 0.
As we have seen earlier, the activation function should be differentiable, but the step function is not differentiable which means the derivative of a step function is 0, which can lead to problems when applying gradient descent to train our neural network.
A more common activation function used in a neural network is the sigmoid function.
. s(x) = 1 / (1 + e-x)
A sigmoid function converts independent variables of near infinite range into simple probabilities between 0 and 1.
It is more popular than the step function because it is differentiable and continuous.
However, the two main problems with sigmoid function are
- the outputs are not zero centered and
- It suffers from the vanishing gradient problem where the saturated neurons essentially kill the gradient.
Because of this vanishing gradient, the convergence is very slow, and it is challenging to train deep networks.
Tanh is a hyperbolic trigonometric function. Unlike the Sigmoid function, it is zero centered, since it squashes the output between–1 to 1.
However, it also suffers from the vanishing gradient problem.
tanh(z) = (ez – e-z) / (ez + e-z)
Rectified Linear Unit(ReLU):
The most popular activation function for deep neural networks, because it is extremely computationally efficient and also it avoids vanishing gradient problem since the gradient of a ReLU is either zero or a constant.
ReLU activation functions have shown to train better in practice than sigmoid or tanh activation functions.
.R(x) = max(0,x)
However, it suffers from a problem called dead ReLU.
A dead ReLU always outputs the same value usually zero, and it becomes resistant to backpropagation updates.
Leaky ReLU is a strategy to mitigate the “dying ReLU” issue. As opposed to having the function being zero when x < 0, the leaky ReLU instead have a small non zero gradient (e.g., “around 0.01”).
.f(x) = 1 if x >= 0
Like sigmoid activation function softmax also squashes the output in the range between 0 and 1.
We use softmax in the output layer of a multi-class classification problem where it gives the probability that an input belongs to a particular class.
If we add all outputs of a softmax function, it always equals to 1.
Which activation functions should you use?
In this post, I have discussed some of the commonly used activation functions.
We use activation functions to introduce non-linearity into the neural network. Now the question is which one should you use?
In hidden layers, almost in all situations, you can use ReLU. If you find dead activations in your network, then you can try the variants of ReLU like Leaky ReLU.
For an output layer, we can use softmax as our activation function.