Cross-validation is an important evaluation technique used to assess the generalization performance of a machine learning model.
It helps us to measure how well a model generalizes on a training data set. There are two main categories of cross-validation in machine learning.
- Exhaustive
- Non-Exhaustive
Before we discuss these two types, let’s first understand why validation set is essential.
WHY IS VALIDATION SET IMPORTANT?
It is vital to make sure that we don’t use our test data except for assessing the final model.
For example, assume that you are developing a K-Nearest Neighbor classifier and you split the data set into two partitions train and test set.
We’ll train the classifier in the training set and use the test data to find the optimal number of K. Apparently the best k corresponds to the lowest test error rate.
Let’s suppose, that we get 90% accuracy in the test set for K = 3.
Now can we say that our model has an accuracy of 90%.
The answer is no. Since we use the test data in the creation of our model for choosing the value for the parameter K, the error rate estimate of our model will be biased. This is known as data snooping.
Because of this, we can’t be certain that our model will perform well for unseen data points.
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.
– 5.3) Data Snooping, Learning From Data: A Short Course
Using the test set other than testing the performance of a fully trained model will inflate the performance of the model when it is tested on the test set.
To overcome this issue, we’ll divide our data into three sets.
Train Set: Used to train the model.
Validation Set: Used to tune the parameters like K in K-NN or the number of hidden layers in a Neural Network.
Test Set: Used to asses the performance of a fully-trained model.
EXHAUSTIVE:
According to Wikipedia, exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set.
Two types of exhaustive cross-validation are
1) Leave-P-Out Cross-Validation:
In this strategy, p observations are used for validation, and the remaining is used for training.
For a data set with n observations, n-p observations will be used for training, and p will be used for validation.
Since this method is exhaustive, it trains and tests on all possible combinations, and it can become computationally expensive for large values of p.
LPO for p=3
1 2 3 4 5 6 7 8 9 |
import numpy as np from sklearn.model_selection import LeavePOut data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) lpo = LeavePOut(p=2) for train, validate in lpo.split(data): print("Train set:{}".format(data[train]), "Test set:{}".format(data[validate])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
OUTPUT: Train set:[0.3 0.4 0.5 0.6] Test set:[0.1 0.2] Train set:[0.2 0.4 0.5 0.6] Test set:[0.1 0.3] Train set:[0.2 0.3 0.5 0.6] Test set:[0.1 0.4] Train set:[0.2 0.3 0.4 0.6] Test set:[0.1 0.5] Train set:[0.2 0.3 0.4 0.5] Test set:[0.1 0.6] Train set:[0.1 0.4 0.5 0.6] Test set:[0.2 0.3] Train set:[0.1 0.3 0.5 0.6] Test set:[0.2 0.4] Train set:[0.1 0.3 0.4 0.6] Test set:[0.2 0.5] Train set:[0.1 0.3 0.4 0.5] Test set:[0.2 0.6] Train set:[0.1 0.2 0.5 0.6] Test set:[0.3 0.4] Train set:[0.1 0.2 0.4 0.6] Test set:[0.3 0.5] Train set:[0.1 0.2 0.4 0.5] Test set:[0.3 0.6] Train set:[0.1 0.2 0.3 0.6] Test set:[0.4 0.5] Train set:[0.1 0.2 0.3 0.5] Test set:[0.4 0.6] Train set:[0.1 0.2 0.3 0.4] Test set:[0.5 0.6] |
2) Leave-one-out Cross-Validation:
This is a variant of LPO. When p = 1 in Leave-P-Out cross-validation then it is Leave-One-Out cross-validation.
In LOO, one observation is taken out of the training set as a validation set.
We will train the model without this validation set and later test whether it correctly classify the observation.
Leave-One-Out CV
1 2 3 4 5 6 7 8 |
import numpy as np from sklearn.model_selection import LeaveOneOut data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) loo = LeaveOneOut() for train, test in loo.split(data): print('Train:', data[train], 'Test:', data[test]) |
1 2 3 4 5 6 7 |
OUTPUT: Train: [0.2 0.3 0.4 0.5 0.6] Test: [0.1] Train: [0.1 0.3 0.4 0.5 0.6] Test: [0.2] Train: [0.1 0.2 0.4 0.5 0.6] Test: [0.3] Train: [0.1 0.2 0.3 0.5 0.6] Test: [0.4] Train: [0.1 0.2 0.3 0.4 0.6] Test: [0.5] Train: [0.1 0.2 0.3 0.4 0.5] Test: [0.6] |
NON-EXHAUSTIVE:
In the non-exhaustive method, we don’t compute for all possible combinations of the original data.
1) HOLDOUT:
This is the simplest method of all. In Holdout validation, the data is randomly partitioned into train and test set.
Most of the times it is 70/30 or 80/20 split.
We train our model in the training set, and it’ll be tested in the test set to see how well the model is performing for unknown events.
Holdout Validation- 70% train and 30% test
1 2 3 4 5 6 7 8 |
import numpy as np from sklearn.model_selection import train_test_split data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) train, test = train_test_split(data, test_size=0.3, random_state=43) print('Train:', train, 'Test:', test) |
1 2 |
OUTPUT: Train: [0.8 0.3 0.6 0.2 0.1 0.5] Test: [0.4 0.9 0.7] |
2) K-FOLD:
This is the frequently used cross-validation method.
In k-fold cross-validation, we split the training data set randomly into k equal subsets or folds.
Out of these k subsets, we’ll treat k-1 subsets as the training set and the remaining as our test set.
This process is repeated for k iterations. For each iteration, a different fold is held-out for testing, and the remaining k-1 is used for training.
K-Fold for k=5
1 2 3 4 5 6 7 8 9 |
import numpy as np from sklearn.model_selection import KFold data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) kf = KFold(n_splits=3) for train, test in kf.split(data): print('Train{}'.format(data[train]), 'Test{}'.format(data[test])) |
1 2 3 4 |
OUTPUT: Train[0.4 0.5 0.6 0.7 0.8 0.9] Test[0.1 0.2 0.3] Train[0.1 0.2 0.3 0.7 0.8 0.9] Test[0.4 0.5 0.6] Train[0.1 0.2 0.3 0.4 0.5 0.6] Test[0.7 0.8 0.9] |
3) STRATIFIED K-FOLD:
We use stratified K-fold to cope with class imbalances in the data set.
Stratified K-fold maintains the class proportions by splitting the data set in such a way that they contain approximately the same proportions of labels as in the original data set.
This strategy guarantees that when the data set is unbalanced, one class of the data is not over-represented.
1 2 3 4 5 6 7 8 9 10 |
import numpy as np from sklearn.model_selection import StratifiedKFold data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) y = np.array([1,1,1,1,1,0]) skf = StratifiedKFold(n_splits = 2, shuffle=True) for train, validate in skf.split(data, y): print('Tran:', data[train], 'Test:', data[validate]) |
1 2 3 |
OUTPUT: Tran: [0.1 0.4] Test: [0.2 0.3 0.5 0.6] Tran: [0.2 0.3 0.5 0.6] Test: [0.1 0.4] |
4) MONTE CARLO:
Monte Carlo validation splits the data randomly into train and test set, and this process is repeated multiple times. The results are averaged over all splits.
The disadvantage of this method is that some observations may never be chosen, whereas some might be selected multiple times.
Monte Carlo CV
1 2 3 4 5 6 7 8 9 |
from sklearn.model_selection import ShuffleSplit import numpy as np data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) shuffle = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42) for train, validate in shuffle.split(data): print('Train:', data[train], 'Test:', data[validate]) |
1 2 3 4 5 6 |
OUTPUT: Train: [0.6 0.1 0.9 0.3 0.5 0.4 0.7] Test: [0.8 0.2] Train: [0.6 0.4 0.5 0.8 0.9 0.7 0.3] Test: [0.1 0.2] Train: [0.1 0.7 0.8 0.6 0.4 0.2 0.5] Test: [0.9 0.3] Train: [0.7 0.3 0.9 0.1 0.4 0.5 0.6] Test: [0.2 0.8] Train: [0.5 0.9 0.1 0.8 0.7 0.4 0.3] Test: [0.2 0.6] |
SUMMARY:
In this tutorial, we discussed various types of cross-validation used in machine learning.
Cross-validation is a method to estimate how well a model generalizes on a training data set.
We mainly discussed two types of cross-validation Exhaustive and Non-Exhaustive.
Exhaustive: This method learns and tests on all possible ways to divide the original sample into a training and a validation set.
Non-Exhaustive: We don’t compute for all possible combinations of the original data.
In the Exhaustive method, we discussed Leave-P-Out and Leave-One-Out CV.
In Non-Exhaustive, we discussed methods like Holdout, K-Fold, Stratified K-Fold, and Monte Carlo CV.