A machine learning pipeline bundles up the sequence of steps into a single unit.
For example, in text classification, the documents go through an imperative sequence of steps like tokenizing, cleaning, extraction of features and training. A pipeline can be used to bundle up all these steps into a single unit.
Now let’s see how to construct a pipeline. Here our pipeline will have two steps, scaling the data using StandardScaler and classification using KNN.
The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task.
1 2 3 4 5 |
from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline |
Let’s load the iris dataset and split it into train and test set.
1 2 3 4 5 |
iris = load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43) |
Now let’s create a simple pipeline using the Pipeline() class.
1 |
pipe = Pipeline([('scaler', StandardScaler()), ('clf',LogisticRegression())]) |
The pipeline consists of a list of steps involved where each of them is described as a tuple which has a string that has the name for the step and an instance of the class.
Now we can fit the pipeline to the data
1 |
pipe.fit(X_train, y_train) |
When the pipe.fit is called it first transforms the data using StandardScaler and then, the samples are passed on to the estimator, which is a KNN model.
If the last estimator is a classifier then we can also use the predict or score method on the pipeline.
1 2 |
score = pipe.score(X_test, y_test) print(score) |
1 2 |
OUTPUT: 0.9333333333333333 |
The other way to create a pipeline is by using make_pipeline.
The difference in usage between Pipeline and make_pipeline is that in make_pipeline the names of the steps are generated automatically, basically, the name of the steps is the class name of the estimators or transformers in lower case.
1 2 3 |
from sklearn.pipeline import make_pipeline pipe2 = make_pipeline(StandardScaler(), LogisticRegression()) print(pipe2.fit(X_train, y_train).score(X_test, y_test)) |
1 2 |
OUTPUT: 0.9333333333333333 |
We can also loop through multiple models in a pipeline.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from sklearn.svm import SVC from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier classifiers = [ KNeighborsClassifier(), SVC(), LogisticRegression() ] for clf in classifiers: pipe3 = Pipeline([('scaler', StandardScaler()), ('clf',clf)]) pipe3.fit(X_train, y_train) score = pipe3.score(X_test, y_test) print(clf) print(score) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
OUTPUT: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform') 0.9555555555555556 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto_deprecated', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.9777777777777777 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False) 0.9333333333333333 |
Using Pipeline with GridSearchCV:
In order to find the best configuration of the hyperparameters, we can also use the pipeline with GridSearchCV.
We first need to define a parameter grid for the model. To define a parameter we need to specify the step name followed by __ (dunderscore), followed by the parameter name.
For instance, to address the n_neighbors hyperparameter in KNN we have to use “clf__n_neighbors”.
The grid search model searches throughout the pipeline for the ideal set of parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.model_selection import GridSearchCV from sklearn.decomposition import PCA pipe4 = Pipeline([('scaler', StandardScaler()), ('pca', PCA()), ('clf',KNeighborsClassifier())]) param_grid = { 'pca__n_components': [2,3,4], 'clf__n_neighbors': np.arange(1,30,2) } gcv = GridSearchCV(pipe4, param_grid, n_jobs=-1) gcv.fit(X_train, y_train) print('Best Parameter:', gcv.best_params_) print('Best Score:', gcv.best_score_) |
1 2 3 |
OUTPUT: Best Parameter: {'clf__n_neighbors': 9, 'pca__n_components': 3} Best Score: 0.9523809523809523 |
SUMMARY:
In this article, we discussed pipelines in machine learning. A machine learning pipeline bundles up the sequence of steps into a single unit.
We created a simple pipeline using scikit-learn. We can create a pipeline either by using Pipeline or by using make_pipeline.
Then we saw how we can loop through multiple models in a pipeline.
Finally, we discussed how to use GridSearchCV with pipeline to find the ideal set of parameters.