From the course: Deep Learning with Python: Optimizing Deep Learning Models

Learning rate scheduling - Python Tutorial

From the course: Deep Learning with Python: Optimizing Deep Learning Models

Learning rate scheduling

- [Instructor] Learning rate scheduling is a technique used in deep learning to adjust the learning rate during the training process to improve convergence and model performance. The learning rate is one of the most important settings because it determines how the model adjust its internal parameters, weights and biases, and response to errors. If the learning rate is too high, the model might skip over the best solution and never converge. If it's too low, training becomes slow and might get stuck in less optimal solutions. Learning rate scheduling solves this by automatically adjusting the learning rate over time to make training faster and more effective. At the beginning of training, models often benefit from a higher learning rate 'cause it allows for large quick adjustments that help the model move toward a good general area of the solution. As training progresses, the learning rate is gradually lowered, so the model can fine tune its parameters with smaller adjustments, reducing the chance of overshooting the best solution. There are several strategies to learn with scheduling, each with its own unique characteristics. Step decay reduces the learning rate by a fixed factor at regular intervals, providing a straightforward way to accelerate training early on while fine tuning later. This is like turning down the volume on a speaker in steps. It works well for training models where progress naturally slows down after a few stages. However, the abrupt changes in step decay can sometimes destabilize the training process. With exponential decay, instead of making abrupt changes, the learning rate is reduced gradually over time following a curve. Think of it as gently easing up the gas pedal in a car rather than slamming on the brakes. It avoids certain shifts in the learning rate, but does require careful tuning of the decay rate. Another popular method, cosine annealing, uses a cosine function to decrease the learning rate smoothly over time, often resetting periodically to encourage exploration of the lost landscape. Cosine annealing can be compared to dimming the light in the room using a dimmer switch. At first, the brightness decreases gradually and smoothly, creating a calming transition. Periodic resets are like momentarily brightening the lights again to inspect the room before dimming them once more. It's particularly useful for tasks where progress can stall and restarting at a slightly higher learning rate occasionally can help the model find better solutions. While effective, cosine annealing requires prior knowledge of the total training duration to plan its schedule. Instead of always reducing the learning rate, the cyclical learning rate cycles approach cycles the learning rate up and down between a minimum and maximum value. Cyclical learning rates can be compared to riding a bicycle up and down a series of hills and valleys. As you approach a steep incline, analogous to the difficult regions of the lost landscape, you pedal harder with increased effort to push through. Once you reach the top and start coasting downhill, analogous to the easier regions, you ease off and slow down to regain control. This technical adjustment helps balance exploration and refinement, ensuring steady progress without getting stuck or overshooting. The use of the cyclical learning rates helps the model escape areas where progress has stalled, especially in problems where the solution is not a straight path, but involves lots of peaks and valleys. However, it introduces additional hyperparameters such as a cycle length and amplitude, which can complicate tuning. Adaptive learning rate scheduling is a performance-driven approach, where the learning rate is adjusted based on how the model is doing. For example, if the model's performance on validation data doesn't improve after a set number of epochs, the learning rate is automatically adjusted. Adaptive learning rate scheduling is like driving a car with an automatic transmission. As you drive, the car adjusts its gears based on the terrain and speed. It shifts to lower gears for uphill climbs and higher gears for smooth flat roads. Using an adaptive learning rate schedule avoids wasting time when the current learning rate is no longer effective. However, it can prolong training if the patience parameter is not well-tuned.

Contents