Activation Functions

An activation function transforms the sum of weighted inputs given to a node in a neural network using a formula.

Process of an activation function
The process

It helps the model to decide if a neuron can be activated and adds non-linearity to a neuron’s output, which enables it to learn in a better manner. 

Back propagation due to Activation Functions

Uses of activation functions in a neural network:

  • Performs predictions when used in output layers
  • Converts linear mappings to non-linear mappings when used after hidden layers
  • Maintains values of gradients to specific limits to prevent issues like exploding gradients and vanishing gradients
  • Better learning and generalization

Some popular activation functions are given below.

Sigmoid or Logistic 

Sigmoid is a nonlinear function. It is a monotonic function that has an S-shaped curve. The function has a range of (0,1) and the total sum of the output nodes is equal to one. Thus, sigmoid is used for binary classification of variables. Also, it is used for multi-label classification i.e. if the outputs are not mutually exclusive. For example, we use a sigmoid in the output layer of a model used for classifying diseases in a chest x-ray image. The image might contain the infection, emphysema, and/or cancer, or none of those findings.

Mathematically it can be represented as:

Sigmoid Function


  • Can be used to obtain the output as a probability


  • Suffers from vanishing gradients problem


ReLU is an abbreviation for Rectified Linear Unit activation function. It is a piecewise linear function and it is one of the most commonly used activation functions in deep neural networks. The neurons are deactivated only when the values are 0. Its range is from 0 to inf.

The curve is also known as the ramp function and it is similar to half-wave rectification (i.e. only the positive or negative part of the input is passed) in electrical engineering

Mathematically it can be represented as:

ReLU Function


  • Prevents the vanishing gradient problem
  • Less computationally expensive
  • Activates the convergence of gradient descent


  • Problem of dead neurons
  • Causes a positive bias shift

Dead neurons are neurons that never get activated during backpropagation since the negative values become zero.


Tanh is very similar to the sigmoid function, but its better. The output value of the function lies between [0,1]. It has an S-shaped curved which is more centered at zero. It is used for classifiers but you should be careful while using it for a large number of epochs.

Mathematically it can be represented as:

Tanh Function


  • Data is zero centered than Sigmoid


  • Suffers from vanishing gradients problem

Leaky ReLU

The leaky ReLU is a type of ReLU that has a very small negative slope. It is a fast learner and more balanced than ReLU. It is used when you want neurons to be activated for negative input values.

Mathematically it can be represented as:

Leaky relu
Leaky ReLU Function


  • Used when sparse gradients are available


  • May cause dead neurons

Parameterized ReLU

Parameterized ReLU is a version of ReLU where the slope of the negative part is taken as an argument ‘a’. It is used when you want neurons to be activated for negative input values and want to vary the slope of the function for negative values

Mathematically it can be represented as:

Parametrized RELU
Parametrized ReLU


  • Better at solving the problem of dead neurons than ReLU and Leaky ReLU


  • Variation of argument ‘a’ may affect learning of the model

Exponential Linear Units (ELUs)

In ELU, which is similar to ReLU except the slope of the negative part of the function is modified using an exponential. It tends to converge the cost or loss to zero faster than ReLU.

Mathematically it can be represented as:

ELU Function


  • Better at converging than ReLU, Leaky ReLU and Parametrized ReLU
  • Has a smoother negative curve


The swish activation function is similar to Sigmoid computationally and performs better on deeper models. Its negative side curve is logarithmic and smoother than ReLU. It works better than ReLU for deep neural networks.

Mathematically it can be represented as:

Swish Function


The softmax function outputs a vector of values that sum to 1 that can be interpreted as probabilities of class membership. It is an extension of the sigmoid function to multi-class classification.  Softmax is commonly used as an activation function for the last layer. For example, we can use softmax in the last layer of a model that is used to classify cars. The car can only belong to one specific manufacturer.

Mathematically it can be represented as:

Softmax formula

Which activation function should I use for my neural network?

There is no such general formula to choose an activation function.  There are many considerations, but the general rule of thumb is to try out the most recommended one and move on to others if it doesn’t give you the desired results.

  • Softmax function is used for multi-class classification
  • ReLU is a default choice for neural networks and works best among all in hidden layers
  • ReLU is used in hidden layers of CNNs and Tanh and Sigmoid is used in hidden layers of RNNs
  • Sigmoid functions and their combinations generally work better in the case of binary and multi-label classification problems in the output layer.
  • Sigmoid and tanh functions are avoided due to the vanishing gradient problem, especially in the hidden layers.
  • Swish function is used when neural networks have more than 40 layers
  • Tanh is avoided due to dead neuron problem

Similar Posts

Leave a Reply

Your email address will not be published.