Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Introduction to activation functions

Why do we need activation functions?

We use activation function to introduce non-linearity into the neural network. Activation functions convert independent variables of near infinite range into simple probabilities between 0 and 1. Now the question is, why do we need non-linearities?

Let’s take a linear equation, y = x. If we plot it will look like this


Linear functions always form a straight line. Even if you add more dimensions, it always forms a straight line.

Let’s take a nonlinear equation,y = x2. Now if we plot it, as you can see have curves.


Since neural networks are “Universal Approximators,” it can compute any function.

However, if we don’t have a non-linear activation function, it always produces a linear output.

Remember a neural network without an activation function is just a linear regression model.

Another important thing about activation functions is that it should be differentiable.

Why because we use backpropagation to calculate the gradient of the error function with respect to the weights. A differentiable activation function is necessary to perform backpropagation.

Now let’s see some of the commonly used activation functions.

Step Function:

The most simple activation function is the step function which is used by the perceptron model. It is a threshold based activation function.

step activation function

.                                                                                                                                                                                                                                                                     f(x) = x if x >= 0

.                                                                                                                                                                                                                                                                                                         0 otherwise

As we can see from the equation above, this is a straightforward threshold function.

If the input value is greater than a particular threshold then it returns the same value, else it returns 0.

As we have seen earlier, the activation function should be differentiable, but the step function is not differentiable which means the derivative of a step function is 0, which can lead to problems when applying gradient descent to train our neural network.

Sigmoid Function:

A more common activation function used in a neural network is the sigmoid function.

sigmoid activation function

.                                                                                                                                                                                                                                                                      s(x) = 1 / (1 + e-x)

A sigmoid function converts independent variables of near infinite range into simple probabilities between 0 and 1.

It is more popular than the step function because it is differentiable and continuous.

However, the two main problems with sigmoid function are

  • the outputs are not zero centered and
  • It suffers from the vanishing gradient problem where the saturated neurons essentially kill the gradient.

Because of this vanishing gradient, the convergence is very slow, and it is challenging to train deep networks.

Tanh Function:

Tanh is a hyperbolic trigonometric function. Unlike the Sigmoid function, it is zero centered, since it squashes the output between–1 to 1.

However, it also suffers from the vanishing gradient problem.

tanh activation function

                                                                                                                                                                                                                                                                                                                                                                                                   tanh(z) = (ez – e-z) / (ez + e-z)

Rectified Linear Unit(ReLU):

The most popular activation function for deep neural networks, because it is extremely computationally efficient and also it avoids vanishing gradient problem since the gradient of a ReLU is either zero or a constant.

ReLU activation functions have shown to train better in practice than sigmoid or tanh activation functions.

relu activation function

                                                                                                                                                                                                                                                                                                   .R(x) = max(0,x)

However, it suffers from a problem called dead ReLU.

A dead ReLU always outputs the same value usually zero, and it becomes resistant to backpropagation updates.

Leaky ReLU:

Leaky ReLU is a strategy to mitigate the “dying ReLU” issue. As opposed to having the function being zero when x < 0, the leaky ReLU instead have a small non zero gradient (e.g., “around 0.01”).

leaky relu activation function

                                                                                                    .f(x) = 1 if x >= 0

                                                                                                     .0.01x otherwise


Like sigmoid activation function softmax also squashes the output in the range between 0 and 1.

We use softmax in the output layer of a multi-class classification problem where it gives the probability that an input belongs to a particular class.

If we add all outputs of a softmax function, it always equals to 1.

Which activation functions should you use?

In this post, I have discussed some of the commonly used activation functions.

We use activation functions to introduce non-linearity into the neural network. Now the question is which one should you use?

In hidden layers, almost in all situations, you can use ReLU. If you find dead activations in your network, then you can try the variants of ReLU like Leaky ReLU.

For an output layer, we can use softmax as our activation function.

Love What you Read. Subscribe to our Newsletter.

Stay up to date! We’ll send the content straight to your inbox, once a week. We promise not to spam you.

Subscribe Now! We'll keep you updated.