Vanishing Gradient Problem in Deep Learning: Explained

17 hours ago

Imagine a scenario where you want to train a deep learning model to understand and distinguish between cats and dogs. Initially, things seem to be promising; your model works well, and it learns to identify basic shapes. But as soon as you stack up a few more hidden layers to capture complex features, something odd happens: the network becomes harder and harder to train. The accuracy curve drops, and no matter how long you train or tune the parameters, the models just do not improve.

A fundamental issue in deep learning often causes this frustrating scenario: the vanishing gradient problem. It so happens that as the backpropagation occurs to update the weights through multiple layers, the gradients often shrink exponentially, hence becoming so small that no changes are observed in the weights. This stalls the entire training process and prevents the model from reaching the global minima.

In this article, we’ll explore what the vanishing gradient problem is, why it happens, and how modern architectures and activation functions are designed to mitigate it.

Before diving into the vanishing gradient problem, it’s important to have a solid understanding of the following foundational concepts in deep learning and calculus. These concepts will help you grasp the problem better.

Forward and Backward Propagation: You should know how data moves through a neural network during the forward pass to make predictions, and how gradients are calculated and propagated backward to update weights during training.

Gradient Descent: Familiarity with how optimization algorithms like gradient descent minimize a loss function by iteratively adjusting model parameters is essential.
Global and Local Minima: Understanding the concepts of global minimum (the lowest possible value of a loss function) and local minima (points that are lower than their neighbors but not necessarily the lowest overall) helps contextualize optimization challenges.
Partial Derivatives and Differentiation: Since gradients are computed using partial derivatives of the loss function with respect to each weight, a basic grasp of calculus and how to differentiate functions is necessary.
Chain Rule of Differentiation: Backpropagation heavily relies on the chain rule to compute gradients across layers, making this rule central to understanding how gradients can vanish.

Before starting with the vanishing gradient problem, we will first try to understand a few basic concepts.

Forward propagation is the process by which input data passes through a neural network to generate predictions. Let us assume a simple neural network:

1 input layer with input vectors x1, x2.
1 hidden layer with weights w1, biases b1, and activation function 𝛔.
1 output layer with weights w2, biases b2.
Output ŷ and true label y.
Loss function: (MSE)

Here, the activation function we are talking about is specifically the sigmoid activation function. Now, the sigmoid activation function is a mathematical function used in neural networks to introduce non-linearity into the model. It takes any real-valued number and maps it to a value between 0 and 1, making it especially useful for models that need to output probabilities.

During front propagation, here is the calculation that goes in the hidden layer:

Output layer calculations:

After the model generates an output (denoted as ŷ), this predicted value is compared to the actual target value using a loss function, which calculates how far off the prediction is. The resulting loss quantifies the model’s error. This loss is then passed to an optimizer, which adjusts the model’s weights in a way that aims to reduce the loss in future iterations—ultimately helping the model learn better predictions.

And how this optimizer reduces the loss function is by updating the weights during backpropagation.

Backpropagation (short for “backward propagation of errors”) is the key algorithm used to train neural networks. It allows the model to learn from mistakes by adjusting its internal parameters (weights and biases) based on how wrong its predictions were.

Imagine a neural network and an input is passed through the network layer by layer to compute the output ŷ (prediction).
The predicted output ŷ is compared with the true label y using a loss function (e.g., MSE, cross-entropy) to calculate the error.
Backward Pass (Backpropagation): This is where the magic happens. The algorithm computes the gradient of the loss with respect to each weight in the network using the chain rule from calculus. These gradients shows how much a small change in each weight will affect the loss.
Weight Updates: The optimizer (e.g., SGD, Adam) uses these gradients to update the weights:

Where:

w is the weight,
η is the learning rate,
l/w is the with respect to the weight.

During the backpropagation, we try to update the weights in a way that we reduce the loss function and reach the global minimum. This process makes deep learning possible by efficiently computing gradients for millions of parameters using shared intermediate computations.

Python Code Example (Simple NN with Backprop)

import numpy as np # Activation and its derivative def relu(x): return np.maximum(0, x) def relu_deriv(x): return (x > 0).astype(float) # Initialize data and weights x = np.array([[0.5], [0.3]]) # Input (2,) y = np.array([[1.0]]) # Target (1,) W1 = np.random.randn(3, 2) b1 = np.zeros((3, 1)) W2 = np.random.randn(1, 3) b2 = np.zeros((1, 1)) # Forward pass z1 = W1 @ x + b1 a1 = relu(z1) z2 = W2 @ a1 + b2 y_pred = z2 loss = 0.5 * np.square(y – y_pred) # Backpropagation dL_dy = y_pred – y dL_dW2 = dL_dy @ a1.T dL_db2 = dL_dy dL_da1 = W2.T @ dL_dy dL_dz1 = dL_da1 * relu_deriv(z1) dL_dW1 = dL_dz1 @ x.T dL_db1 = dL_dz1

Now, during backpropagation, we compute the gradient of the loss function with respect to each weight ∂l/∂w in the network using the chain rule. Let’s look at the math behind it for one layer: Let:

x be the input feature vector,
W be the weight matrix,
b be the bias,
z = Wx+bz,
α = σ(z) be the activation function( sigmoid in this case),
ŷ = α (for simplicity in 1-layer example),
L = (Mean Squared Error).

Then the gradient of the loss with respect to the weights is:

This equation represents the chain rule to calculate gradients layer by layer.

Now, the point to be noted here is that we are using a sigmoid activation function in each layer.

(Image Source)

As we learnt earlier, the sigmoid activation function outputs a value between 0 and 1. However, the derivative of the sigmoid will be between (0,0.25).

The derivative of the sigmoid function is given by σ′(x)=σ(x)(1−σ(x)), where σ(x) is the output of the sigmoid function and always lies in the open interval (0, 1). This expression is a product of two positive numbers: σ(x) and 1−σ(x). The maximum value of this product occurs when both factors are equal, i.e., when σ(x)=0.5. At this point, the derivative reaches its peak value of 0.5⋅(1−0.5)=0.25. For any other value of σ(x), either closer to 0 or 1, the product becomes smaller. Therefore, the derivative of the sigmoid function is always positive but strictly less than or equal to 0.25.

Now, let us recap here:

During backpropagation, the derivative of the loss function with respect to a weight w is computed as:

∂y/∂z = σ′(z) = σ(z)(1−σ(z))

The problem lies in σ′(z), which—as we’ve seen—is always ≤ 0.25, and can get very close to 0 when z is large or small.
Example,

x = 1.0,
w = 5.0,
b = 0,
So, z = w⋅x+b,
σ(5) ≈ 0.993,
σ′(5)=σ(5)⋅(1−σ(5))≈0.993⋅(1−0.993)≈0.0069.

Now, during backpropagation:

Let’s assume the error = 0.5, so:

That’s a very small gradient mathematically; if many small values (e.g., 0.5) are multiplied across layers during backpropagation, the resulting gradient becomes exponentially smaller.

If this happens in each layer of a deep network, the gradients keep getting multiplied by values like 0.003 and quickly become negligible by the time they reach earlier layers. As a result, those early layers learn very slowly or not at all—this is the vanishing gradient problem.

import torch import torch.nn as nn import matplotlib.pyplot as plt # Simple deep network class DeepNet(nn.Module): def __init__(self, depth): super().__init__() self.layers = nn.Sequential(*[ nn.Linear(100, 100), nn.Sigmoid() ] * depth) def forward(self, x): return self.layers(x) # Input and model x = torch.randn(100) x.requires_grad = True model = DeepNet(depth=30) y = model(x) y.sum().backward() # Plot gradients plt.plot(x.grad.detach().numpy()) plt.title(“Gradient Flow After 30 Layers”) plt.xlabel(“Input Index”) plt.ylabel(“Gradient Value”) plt.grid(True) plt.show()

Earlier, we understood how the sigmoid activation function leads to repeated multiplication of small derivatives, and hence causes the vanishing gradient problem, which stalls the learning process for early layers in deep architectures.

ReLU Activation

Now, to solve this problem, researchers have derived an activation function known as ReLU and its variant. Here is the concept of ReLU (Rectified Linear Unit):

This means:

If x > 0, then f(x)=x
If x ≤ 0then f(x)=0

How does ReLU solve the vanishing gradient?

The ReLU (Rectified Linear Unit) activation function is widely used in deep learning due to its simplicity and effectiveness. It is defined as f(x)=max⁡(0,x), meaning it outputs the input directly if it is positive, and zero otherwise. This simple function helps mitigate the vanishing gradient problem, which often occurs with sigmoid or even tanh functions. ReLU allows gradients to flow through the network without being squashed, as its derivative is 1 for positive inputs, which ensures that the learning continues effectively in deeper layers. However, for inputs less than or equal to zero, the derivative is 0, which can lead to what’s known as the “dying ReLU” problem—where neurons stop updating entirely if they consistently receive non-positive inputs. Despite this, ReLU is computationally efficient and introduces non-linearity, enabling networks to model complex functions. To address its limitations, variants like Leaky ReLU and Parametric ReLU introduce small gradients for negative inputs, helping keep neurons active throughout training.

This is one effective approach to addressing the vanishing gradient problem, though there are several other methods as well, which we will explore in detail in our upcoming articles.

What is the vanishing gradient problem?
It’s when gradients become too small to update earlier layers during backpropagation, slowing or halting learning.

Which activation functions cause vanishing gradients?
Sigmoid and tanh.

How do you fix the vanishing gradient problem?
Use ReLU, batch normalization, residual connections, and proper initialization.

What’s the difference between exploding and vanishing gradients?
Vanishing gradients shrink toward zero; exploding gradients grow excessively.

Why is vanishing gradient a problem in RNNs?
It prevents learning long-term dependencies in sequences.

The vanishing gradient problem is a fundamental challenge in training deep neural networks, particularly when using activation functions like sigmoid or tanh. As gradients shrink layer by layer during backpropagation, early layers struggle to learn effectively, slowing down or even stopping the training process. Understanding how this issue arises, especially through the role of activation functions and derivatives, is the first step toward designing better architectures. Fortunately, techniques like using ReLU and its variants, implementing batch normalization, adopting residual connections, and applying smart weight initialization strategies have significantly improved the trainability of deep models. By incorporating these solutions, we can ensure more stable and efficient learning, even in very deep networks. As deep learning continues to evolve, mastering these foundational issues becomes essential for building robust and high-performing models.

17 hours ago

Creating Next-Level AI Videos with Veo 3

A Guide to Object Detection with Vision-Language Models

Related Articles

Leave a Reply Cancel reply