From the course: Deep Learning with Python: Optimizing Deep Learning Models
Batch normalization - Python Tutorial
From the course: Deep Learning with Python: Optimizing Deep Learning Models
Batch normalization
- [Instructor] In deep learning, as model parameters are updated during training, the distribution of input values in each layer could change as the model learns. This change known as internal covariate shift can slow down the learning process and make it more challenging. Batch normalization solves this by normalizing the inputs to each layer so that they have a consistent scale and distribution during training. Batch normalization operates in three main steps. First, it calculates the mean and variance of each feature in the mini batch. This gives a snapshot of how the inputs are distributed for that batch. Next, it normalizes the inputs to have a zero mean and standard deviation of one. This ensures that the inputs of the layer are standardized, making the model easier to train. Given a mini batch of input B, the normalization approach is represented mathematically as shown here, where Xi hat is a new standardized input, Xi is original input, B bar is the mean of the mini batch, and sigma B is a standard deviation of the mini batch. The third step in batch normalization is to scale and shift the normalized values using two trainable parameters, gamma and beta. Using these parameters allows the model to adjust the normalized values if necessary, so it can still learn the best representation of the data. Mathematically, the scaling and shifting process is represented as shown here, where Yi is the scaled and shifted input, gamma is a scaling parameter, yxi hat is the normalized input, and beta is the shifting parameter. The advantages of batch normalization are significant. It accelerates training by stabilizing the learning process, enabling the use of higher learning rates, and reducing sensitivity to weight initialization. It also improves generalization by acting as a form of regularization, reducing the risk of overfitting. Furthermore, batch normalization simplifies hyperparameter tuning and supports the training of deeper networks by mitigating issues such as vanishing or exploding gradients. Despite its benefits, batch normalization has some limitations. It depends heavily on the mini batch size, as small mini batches may not yield accurate estimates of the mean and variance, leading to degraded performance. Additionally, the computational overhead from extra operations, such as calculating statistics and normalizing inputs can slightly increase training time. Finally, batch normalization is less effective for tasks where batch sizes are small or for recurrent architectures with long sequences.