From the course: Artificial Intelligence Foundations: Machine Learning

Structuring a machine learning pipeline

From the course: Artificial Intelligence Foundations: Machine Learning

Structuring a machine learning pipeline

- Have you ever pressed Shift+Enter or Run Cell inside of your Jupyter Notebook and thought to yourself, there has to be an easier and repeatable way to do this? Well, you would be right. And the answer you're looking for is called a pipeline. The data collection and preparation phase of machine learning consists of many steps, imputing missing data, handling outliers, understanding correlations, feature engineering, encoding, algorithm experimentation, and more. When training a model to predict the cost of homes, we did each step individually, and sometimes even repeated steps when experimenting with a different learning algorithm on the same dataset. Can you imagine supporting a production machine learning system this way? You'd want to use pipelines instead. Pipelines allow you to assemble all the steps of your machine learning workflow together. This assembly helps to streamline the workflow by grouping and automating our data preparation, training, evaluation, tuning, and deployment steps. There are many benefits to incorporating pipelines into your process. First, pipelines allow you to encapsulate your code for easy reuse. During the process of training a model to predict the cost of homes, we experimented with multiple learning algorithms, Linear Regression, RandomForestRegressor, and XGBoost to identify which one would perform better on our data set. We called the fit function multiple times and executed each code block independently. With Pipelines, we can fit and predict only once on the data to fit an entire sequence of estimators. Pipelines add a level of convenience, reproducibility, and a way to enforce your data preparation steps across all iterations of model training. Pipelines can also speed up your training process, giving you a faster time to market by running certain steps in parallel. They also help with iterative hyper parameter tuning. There are many benefits. I see why pipelines are in high demand in the industry. Scikit-learn's pipeline class provides the structure to organize the sequence of steps we need to execute to train a model. First, let's look at the normal machine learning code. We trained multiple models. First, we used linear regression to train a model. We ran the predictions, we evaluated the model. We looked at the R-squared metric, and then we did something similar for RandomForestRegressor. We trained the model, we ran the predictions, and we did the evaluation. And lastly, we used XGBoost. We trained the model, ran the predictions, evaluated the model. That's the code we used. Now let's look at how the code is updated to implement a pipeline. This is the code that we use to train the model in parallel using a pipeline. Look at that. Look at how much more efficient the code is. And when you see the code in action, you'll see how much faster the training process becomes because we can execute three training jobs in parallel. Pipelines help you build quick and efficient machine learning models. Pipelines are quickly gaining popularity in the industry, and we can see why. Automating this process saves time and reduces redundant pre-processing work while organizing your code to be more efficient. Now, let's see pipelines in action.

Contents