From the course: Machine Learning with Python: Decision Trees
How is a regression tree built? - Python Tutorial
From the course: Machine Learning with Python: Decision Trees
How is a regression tree built?
- [Instructor] Similar to classification trees, regression trees are built using a process known as recursive partitioning. For regression trees, the objective of recursive partitioning is to create successive child partitions that have less variability than their parent. To illustrate how recursive partitioning helps us build a regression tree, let's imagine that we work for a placement agency and that we just received the results of an income survey conducted by our agency. Each survey response includes the age of the worker, their level of education or highest degree earned, and their annual salary. Note that age and education level are the independent variables, or predictors, while salary is the dependent variable. Each of the survey responses can be represented on the scatter plot this way in terms of the dependent variable, annual salary, and one of the independent variables, age. Recall that for regression trees, the idea behind recursive partitioning is to repeatedly split data into smaller subsets in such a way that minimizes the variability of values within each subset. So the first thing a regression tree algorithm does is figure out how best to split this data into two so we have partitions or subsets that minimize variability the most. One of the measures that regression tree algorithms rely on to figure out the best split is the sum of squared residuals, or SSR. A residual is the difference between an observed data point and a reference data point such as the mean. The formula for computing the SSR of a partition with n values is shown here where Y-hat is the mean of the values in the partition and Y-i is each value in the partition. The SSR of a partition quantifies the overall difference between the values in the partition and the average of the values in the partition. A partition with high SSR implies that the values in the partition are dissimilar or very different from the mean. This is a partition that poorly explains the data. A partition with low SSR implies that the values in the partition are similar or close to the mean. This is a partition that explains the data well. So how does a regression tree algorithm use SSR to determine the best split? Well, I'm glad you asked. Let's assume that the first split the algorithm evaluates is where age is equal to 27.5. This is the halfway point between the data points for age 25 and those for age 30. The values in the left partition are 16.8, 43.9, and 50.4. The average of these values is 37. Recall that a residual is the difference between an observed data point and a reference data point. The reference data point in this example is the average value. So the residuals are the differences between each value and the mean. To get the SSR, we square the residuals and add them. This comes out to 635.2. Using the same approach for the right partition, we get an SSR of 13106.9. They combined some of squared residuals for both partitions if the data was split by age equal to 27.5 is the sum of the left SSR and the right SSR, which is 13742.1. In order to determine the split that reduces variability the most, the regression tree algorithm evaluates the SSR based on each possible split, and chooses the one with the lowest SSR, which is the split where age is equal to 40. This initial split creates the logic for the root node of our regression tree, which is shown here. It asks the question, is a worker 40 years old or younger? To create the branches and the next set of nodes, the regression tree algorithm makes some generalizations or simplifying assumptions based on the data in the two partitions. The first generalization it makes is based on the left partition. It's estimates that if a worker is 40 years old or younger, then the annual salary will be 44,503, which is the average of the left partition. The second generalization the regression tree algorithm makes is based on the data in the right partition. It estimates that if a worker is older than 40, then their annual salary will be 77,990, which is the average of the right partition. Depending on the data or some user defined criteria, the regression tree algorithm could stop the recursive partitioning process here, or it could keep partitioning the data into smaller subsets with less variability. If the algorithm continues with the recursive partitioning process, the structure of the regression tree will continue to evolve as more partitions are created within the data.