Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Machine Learning Pipeline

A machine learning pipeline bundles up the sequence of steps into a single unit.

For example, in text classification, the documents go through an imperative sequence of steps like tokenizing, cleaning, extraction of features and training. A pipeline can be used to bundle up all these steps into a single unit.

Now let’s see how to construct a pipeline. Here our pipeline will have two steps, scaling the data using StandardScaler and classification using KNN.

The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task.

Let’s load the iris dataset and split it into train and test set.

Now let’s create a simple pipeline using the Pipeline() class.

The pipeline consists of a list of steps involved where each of them is described as a tuple which has a string that has the name for the step and an instance of the class.

Now we can fit the pipeline to the data

When the pipe.fit is called it first transforms the data using StandardScaler and then, the samples are passed on to the estimator, which is a KNN model.

If the last estimator is a classifier then we can also use the predict or score method on the pipeline.

The other way to create a pipeline is by using make_pipeline.

The difference in usage between Pipeline and make_pipeline is that in make_pipeline the names of the steps are generated automatically, basically, the name of the steps is the class name of the estimators or transformers in lower case.

We can also loop through multiple models in a pipeline.

Using Pipeline with GridSearchCV:

In order to find the best configuration of the hyperparameters, we can also use the pipeline with GridSearchCV.

We first need to define a parameter grid for the model. To define a parameter we need to specify the step name followed by __ (dunderscore), followed by the parameter name.

For instance, to address the n_neighbors hyperparameter in KNN we have to use “clf__n_neighbors”.

The grid search model searches throughout the pipeline for the ideal set of parameters.

SUMMARY:

In this article, we discussed pipelines in machine learning. A machine learning pipeline bundles up the sequence of steps into a single unit.

We created a simple pipeline using scikit-learn. We can create a pipeline either by using Pipeline or by using make_pipeline.

Then we saw how we can loop through multiple models in a pipeline.

Finally, we discussed how to use GridSearchCV with pipeline to find the ideal set of parameters.

Love What you Read. Subscribe to our Newsletter.

Stay up to date! We’ll send the content straight to your inbox, once a week. We promise not to spam you.

Subscribe Now! We'll keep you updated.