From the course: Deep Learning with Python: Optimizing Deep Learning Models
Gradient clipping - Python Tutorial
From the course: Deep Learning with Python: Optimizing Deep Learning Models
Gradient clipping
- [Instructor] Gradient Clipping is a technique used in deep learning to prevent the gradients of a model from becoming excessively large during training. This phenomenon known as exploding gradients occurs when the gradients of the loss function respectively model's parameters grow excessively large during back propagation, this can destabilize training cause numerical issues and prevent the model from converging to an optimal solution. Gradient clipping solves this problem by putting a limit or cap on how large the gradients can get. There are two primary approaches to gradient clipping, clipping by value and clipping by norm. In clipping by value, each individual gradient is clipped so that it doesn't exceed a specific minimum value or maximum value. Let's walk through a simple example of how this works. Suppose we have a gradient vector G with values two, negative six, eight, negative three, and five with a gradient threshold C set to four. This means that the minimum greater value allowed is negative four and the maximum four. For each gradient GI in G, if the value is greater than four, it'll be clipped to four, and if the value is less than negative four, it'll be clipped to negative four. Therefore, with a threshold of four. Clipping the gradient vector by value results in the vector with values two, negative four, four, negative three, and four. Alternatively, instead of clipping individual gradients, clipping by norm constraints the overall size or norm of the gradient vector to a maximum threshold C. If the total norm exceeds C, all gradients in the vector are scaled down proportionally, so the norm equals C. The L two norm of a gradient vector G is computed as shown here, where GI represents each of the end gradients in the vector. If the L two norm of the gradient vector exceeds the maximum threshold C, then the vector is scaled by a factor calculated as shown here. Let's walk through a simple example of how this works. Suppose we have a gradient vector G with values two, negative six, eight, negative three, and five with the gradient threshold C set to six. The L two norm of the gradient vector will be 11.75 approximately. Since 11.75 is greater than this threshold of six, the vector needs to be rescaled. With the threshold value of six an L two norm of 11.75, the scaling factor will be 0.51 approximately. Each gradient in the vector is multiplied by the scaling factor, resulting in the vector that with values 1.02, negative 3.06, 4.08, negative 1.53 and 2.55. At a high level, you choose clipping by value, if you need a straightforward approach to limit gradients and don't want to compute norms or when you need to limit the magnitude of individual gradient components to avoid drastic updates to specific parameters. Clipping by norm, on the other hand, is better suited for situations where it is critical to maintain the relative direction of the gradient vector while controlling its overall magnitude, ensuring consistent updates across all parameters. In summary, the choice between clipping by value and clipping by norm depends on the specific characteristics of your model, training setup, and the type of instability you are encountering. Gradient clipping offers several benefits that make it a valuable tool in deep learning. One of its primary advantages is that it stabilizes the training process by limiting the size of gradients, it prevents erratic updates to model parameters resulting in smoother convergence. This stability is especially beneficial for deep networks and recurrent neural networks where exploding gradients are more common due to the long chains of back propagation through many layers or timestamps. Gradient clipping also comes with certain limitations. While it effectively addresses exploding gradients, it does not tackle the underlying causes of the problem such as poor weight initialization or an unsuitable model architecture. This means that gradient clipping can act as a bandaid solution, masking deeper issues in the training setup. Additionally, its effectiveness depends on selecting an appropriate clipping threshold, which is not always straightforward, and often requires careful experimentation to optimize.