From the course: Artificial Intelligence Foundations: Neural Networks

Regularization techniques to improve overfitting models

From the course: Artificial Intelligence Foundations: Neural Networks

Regularization techniques to improve overfitting models

- [Instructor] The purpose of a neural network is to capture the dominant trends in the data. Overfitting is bad because it means that the machine learning algorithm did not capture the dominant trend in the data and therefore won't be able to recognize any trend on new data it has never seen. This means that the model did not really learn anything but only memorize the training data without understanding it. This means that your model cannot make accurate predictions so your validation error is large while your training error is small as is shown in the image on the left. Regularization is a hyperparameter technique to improve overfitting models. It refers to a set of different techniques that lower the complexity of a neural network model during training and thus may prevent overfitting. The image on the right shows a list of regularization techniques to help mitigate overfitting. Let's take a look at three of them. Early stopping, dropout and regularization. In early stopping, you stop the training process early before the model learns the patterns and details of the data. Dropout works by randomly dropping out or setting to zero neurons during training. Regularization is a technique used in machine learning to prevent models from overfitting the training data by adding a penalty to the model's loss function which encourages the model to be simpler and less likely to overfit. Early stopping is a regularization technique that helps us avoid overfitting by stopping the training process of a neural network before it reaches the maximum number of iterations. The training process is stopped early before the model has had a chance to learn the specific details of the training data and start to overfit. So early stopping helps to avoid overfitting by monitoring, storing and updating the best parameters during training so that when parameter updates no longer yield an improvement. After a set number of iterations, training is stopped and the last best parameter is used. If the performance of the model on the validation data set starts to degrade, eg. loss begins to increase, or accuracy begins to decrease then the training process is stopped. To implement early stopping, you need to split your data into three sets, training, validation, and test. The training set is used to update the network parameters. The validation set is used to evaluate the network performance and decide when to stop the training and this test set is used to measure the final accuracy of the network. The model is trained on the training set and its performance is evaluated on the validation set. If the model's performance on the validation set starts to decrease, then training is stopped. Another approach to early stopping is to use a callback function. A callback function is a piece of code that is executed at regular intervals during the training process, which returns information from the training learning algorithm. As soon as your chosen metric stops improving for a fixed number of iterations, training stops as shown in the image here where the carrot's early stopping callback is used to monitor specific metrics like validation loss. Essentially, the callback function can be used to monitor the model's performance on the validation set and to stop training if the model's performance starts to decrease. With dropout, essentially at every iteration, a node is randomly selected and dropped along with all of their incoming and outgoing connections which means we don't use those nodes in both forward propagation and back propagation. This means that all nodes are not active at the same time which forces the learning algorithm to spread out the weights and not focus on certain nodes. In machine learning, regularization penalizes the coefficients. In deep learning, it penalizes the weight matrices of the nodes. A penalty term is added to the loss function during training, which discourages the model from becoming too complex or having large parameter values. L1 regularization, also known as lasso regression prevents models from overfitting the training data by shrinking the weight values to zero. This makes some features obsolete. This encourages the model to have fewer features with larger weights, which can help to prevent overfitting. The L1 formula on the right shows a regularization penalty term added to the model's loss function based on the absolute values of the model's parameters. L2 regularization is the most common type of all regularization techniques and it's commonly known as weight decay or rich regression. L2 combats overfitting by forcing weights to be small but not making them exactly zero. The L2 formula on the right shows a penalty term added to the model's loss function where it takes the square of the weights. For example, when you are predicting media sales, this means that less significant features for predicting media sales such as newspaper budget would still have some influence over the final prediction, though very small. The main goal of training a neural network is to acquire a model which is able to generalize optimally on new unseen data. Unfortunately, most neural network architectures often suffer from overfitting or under fitting. Recall that overfitting is the effect when the model fits the training data too well. That means the model is not able to make sense of unseen data it receives. Under fitting is a situation in which a model does not learn the underlying patterns of the data well enough. So in addition to using hyperparameters, best practices for optimizing neural networks include many techniques that involve the feature dataset from feature selection and feature engineering to normalizing, splitting and shuffling the dataset. Recall that in the hands-on data pre-processing portion of the exercise to build a simple neural network, you normalize the features before using them in the model. Since the features had different ranges. You also split the data into a training and test set and it was noted that a best practice is to split the data into three. A training, validation, and test the set.

Contents