From the course: Machine Learning with Python: Decision Trees

How is a classification tree built? - Python Tutorial

From the course: Machine Learning with Python: Decision Trees

How is a classification tree built?

- [Instructor] Classification trees are built using a process known as recursive partitioning. The basic idea behind this process is to repeatedly split data into smaller subsets in such a way that maximizes the homogeneity or similarity of items within each subset. To illustrates how recursive partitioning helps us build a classification tree, let's imagine that we work for a small commercial bank and that we have historical data for 30 personal loans issued by our bank. Each loan record includes the annual income of the borrower, the amount that was borrowed, and the outcome of the loan, which is represented here by the default column. Note that the income and loan amount columns are what we call the independent variables or predictors while the default column is a dependent variable or class. Each of the 30 loans previously issued by our bank can be represented in terms of the dependent and independent variables this way using a scatter plot. From the plot, we can see that of the 30 loans in the dataset, 16 ended in default, the red triangles, and 14 were paid back in full, the green circles. Recall that the idea behind recursive partitioning is to repeatedly split data into smaller subsets in such a way that maximizes the similarity of items within each subset. So the first thing we need to do here is to figure out how best to split this data into two so that we have partitions or subsets that maximize the similarity or purity of outcomes. Using two axis parallel lines, we scan both axes to determine where to split the data. By visual inspection, we find that splitting on the loan amount of $40,000 gives us the best split. Based on the split, we get 14 loans of $40,000 or less to the left and 16 loans of more than $40,000 to the right. Splitting the data this way gives us the two partitions with the most homogeneity of loans in favor of one of the two outcomes. Any other axis parallel line we could have drawn would result in partitions with less purity or homogeneity. Notice that I use the terms homogeneity, similarity, and purity to represent the same idea. This initial split creates the logic for the root node of our classification tree, which is shown here. It simply asks the question, did a customer borrow $40,000 or less? To create the branches and the next set of nodes, we make some generalizations or simplifying assumptions. Of the loans which were more than $40,000, 10 resulted in default while six were paid back in full. In other words, 63% of the loans, or 10 out of 16 loans, in this partition resulted in default. Because default is the dominant outcome in this partition, we will assume or generalize that any future loans that are for more than $40,000 will also result in default. As you can see, some of the loans in the partition, the red circles, do not conform to our assumption. We refer to these as the misclassified examples in the training data. Our goal should be to have very few of these. The assumption we made for this partition determines the structure of the first branch and leaf node in our classification tree. Now let's take a look at the other partition. Of the loans which were $40,000 or less, eight out of 14, or 57% were paid back in full while six resulted in default. Because not default is the dominant outcome in this partition, we generalize that any future loans that are $40,000 or less will also be paid back in full. As expected, we also have some misclassified examples in this partition. These are the green triangles. The generalization we made for this partition determines the structure of the second branch and leaf node in our classification tree. We can stop the recursive partitioning process here or we can decide to keep trying to create purer partitions within the data. For instance, we know that within the left partition, we misclassified six of the 14 examples. To reduce the number of misclassified examples, we need to further partition the data. Using two axis parallel lines, we scan to determine where to split this partition. By visual inspection, we find that splitting on an annual income of $20,000 gives us the best split. Of the eight customers who borrowed $40,000 or less and earn more than $20,000 a year, seven paid their loan back in full and one defaulted. Because not default is the dominant outcome in this partition, we generalize that any future customers who earn more than $20,000 a year and borrow $40,000 or less will also pay back their loan in full. In similar fashion, we generalize that any future customers who earn less than $20,000 a year and borrow $40,000 or less will default on their loan. Note that each of the partitions now only have one misclassified example. They are much purer. These two new partitions and the generalizations we made for them result in a structural change to our classification tree. The tree will now include a new decision node which branches into two new leaf nodes. We can continue the recursive partitioning process in an attempt to create smaller and more homogenous partitions, or we can stop here. Generally, classification tree algorithms continue the recursive partitioning process until all of the instances within the partition are of the same class or value, or all the features in the dataset have been exhausted, or when some user-defined condition has been satisfied.

Contents