From the course: Machine Learning Foundations: Statistics

The correlation

- [Narrator] Your company's HR department wants to gain insight into information about people working in software development. So they ask you to take a look at the data set containing information on 50,000 people in the industry. You expect a strong relationship between the number of years of experience and salary, since seniors are usually paid more than juniors. Opposite of that, there is probably a weak or no relationship between pizza consumption and salary. The statistical relationship between two variables is referred to as their correlation. A correlation can be positive, negative, or neutral. The positive correlation means both variables move in the same direction. Years of experience and salary is a case of positive correlation. The negative correlation means that when one variable increases the other variable decreases. A negative correlation could be that the factory increases safety training, so there are fewer on-the-job injuries. The neutral correlation or zero correlation means that the variables are unrelated, meaning there is no relationship in the change of variables. So when the value of one variable increases or decreases, then the value of the other variable doesn't increase or decrease. ML often uses correlation during data analysis and data modeling. For example, two or more independent variables can have a high correlation with one another in a regression model, and we call this multicollinearity. There are some examples of multicollinearity. It can happen when new variables are created which are dependent on other variables. We can have a dataset that contains the height and weight variables of people and then create a BMI, body mass index variable. It would obviously be redundant information for a model or we can have identical variables in the dataset. For example, the height of the people in centimeters and the height of the people in feet. Multicollinearity can lead to deterioration of the performance of some algorithms. So it's important to recognize the issue and find the appropriate solution.

Contents