From the course: Machine Learning with Python: Decision Trees
How to prune a decision tree - Python Tutorial
From the course: Machine Learning with Python: Decision Trees
How to prune a decision tree
- [Instructor] The basic idea behind recursive partitioning is to repeatedly split data into smaller subsets in such a way that maximizes the homogeneity or similarity of items within each subset. Generally, this process continues until all of the instances within a partition are of the same class or value or all the features in the dataset have been exhausted or when some user-defined condition has been satisfied. Usually, if we allow a tree to grow uninhibited until it meets the first two criteria, it may be too large or it may overfit against the training data. Overfitting occurs when a decision tree fits our data too perfectly. A tree that overfits does a great job explaining the data in the training set but performs poorly when given new or test data. To avoid overfitting, we have to carefully manage the size of a decision tree during or after the recursive partitioning process. this process is known as pruning. Pruning helps a decision tree to generalize well, which means that it will likely perform well when presented with previously unseen data. We can limit the size of a decision tree during the recursive partitioning process by setting criteria that need to be met at each split point. These criteria can be in the form of the maximum number of features to be considered for each split, the maximum number of decision nodes to allow for the tree or the minimum number of data points to allow in each partition. This approach to pruning is known as pre-pruning. Pre-pruning is appealing in that prevents unnecessary branches and nodes from being created, which saves compute cycles. However, pre-pruning could stop tree growth too early. This means that the tree may not get a chance to learn certain patterns in the data. An alternative pruning approach is post-pruning. As the name suggests, the idea here is to allow the decision tree to grow as large as it can and then reduce its size afterward. With regards to compute time, post-pruning is not as sufficient as pre-pruning, however, it does provide the benefits of being more effective in discovering important patterns within the data. There are several approaches to post-pruning. One of the most popular is known as cost complexity pruning or weakest link pruning. To help explain our cost complexity pruning works, let's imagine that we work for a placement agency and that we just received the results of an income survey conducted by our agency. Each survey response includes the age of the worker, their level of education or highest degree earned and their annual salary. Limiting our focus to age and annual salary, each of the survey responses can be represented on a scatterplot this way. Recall that the idea behind recursive partitioning is to repeatedly split data into smaller subsets in such a way that maximizes the homogeneity or similarity of items within each subset. If we allow the recursive partitioning process to continue uninhibited on this data, it would partition on every possible age value and we would end up with partitions that look like this. This would result in a decision tree that looked like this. This decision tree fits our data too perfectly and has likely overfit. A better decision tree for our data is one that is based on fewer than the maximum number of partitions, like one of these shown here. We can think of these sub-trees of pruned version of the previous one. The question then is how do we choose the best sub-tree to use? This is where cost complexity pruning comes in. The first step in the process is to calculate the sum of squared residuals or SSR for each sub-tree. For more information on how to compute the SSR for a tree, watch the previous video in this course sequence, titled "How is a regression tree built?" Notice that the larger the tree gets, the lower its SSR. This means that the larger the tree is, the better job it does at explaining the relationship between the independent variables and the dependent variable. However, we also know that the larger the tree is, the more likely it is to overfit against our data. To account for this, a better metric for which sub-tree to choose is the tree score or cost complexity measure. The cost complexity measure for a tree is the sum of squared residuals for the tree plus the tree complexity penalty. The tree complexity penalty compensates for the number of leaf nodes in the tree. It is a product of the complexity parameter, a user-defined parameter and the number of leaf nodes in the tree. As a side note, because the decision tree used in this illustration is a regression tree, we use the sum of squared residuals in calculating the tree score. If we were working with a classification tree, we would use entropy or gini instead of SSR to calculate the tree score. If we set the alpha, the complexity parameter, to 1,000, the tree score for each sub-tree will be as shown here. This means that we will choose the first sub-tree because it has the lowest tree score. As you may have noticed, the choice of alpha has a significant impact on which tree ends up having the lowest tree score. For example, if we reduce the value of alpha to 100, we get different tree scores for each tree. This means that we would choose this sub-tree instead because it now has the lowest tree score. Higher values for alpha favor simpler trees, while smaller values for alpha favor more complex trees. The question then is how do we decide on a value for alpha? The simple answer is that alpha is a hyperparameter and we use a process known as hyperparameter tuning to find the best value. Conceptually, finding the best alpha value involves training several sub-trees based on different values for alpha, then choosing the sub-tree that performs the best on the test data. The alpha for this tree is the best value for alpha.