The K-nearest neighbor(K-NN) classifier is one of the easiest classification methods to understand and is one of the most basic classification models available.

K-NN is a non-parametric method which classifies based on the distance to the training samples.

K-NN is called a lazy algorithm. Technically, it does not build any model with training data; i.e., it does not really learn anything in the training phase.

Actually, in the training phase, it just stores the training data in the memory and works in the testing phase.

Besides classification, K-nearest neighbor is also sometimes used for regression. However, in this tutorial, we’ll focus solely on the classification setting.

**K-Nearest Neighbor:**

Let’s take a look at K-nearest neighbors from a graphical perspective.

Let’s suppose that we have a dataset with two classes circle and triangle. Visually, this looks like the following.

Now let’s say we have a mystery point whose class we need to predict.

To find out which class it belongs to we need to compare the distance of the mystery point to the training samples and selecting the K nearest neighbors.

The k indicates the number of close training samples to be regarded when predicting an unlabeled test record.

The class label of the new point is determined by a majority vote of its k nearest neighbors. The new point will be assigned to the class with the highest number of votes.

For example, if we choose the value of k to be 3 then the three closest neighbors of the new observation are two circles and one triangle.

Therefore by majority vote, the mystery point will be classified as a circle.

The K-NN algorithm can be summarized as follows:

- Calculate the distances between the new input and all the training data.
- Find the nearest neighbors based on these pairwise distances.
- Classify the point based on a majority vote.

Let’s create a K-NN classifier using sklearn.

First, we’ll import the necessary packages from scikit-learn’s library.

1 2 3 4 5 6 7 |
import numpy as np from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score np.random.seed(2019) |

Now we’ll create a dataset using sklearn’s make_classification and split the dataset into train and test set.

1 2 |
X, y = make_classification(n_samples=300, n_features = 5, n_classes=2) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33) |

We need to define the number of nearest neighbors to build a KNN model. Let’s set the value of k as 3.

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print(f'Accuracy:', (accuracy_score(y_test, y_pred))) |

1 2 |
OUTPUT: Accuracy: 0.9393939393939394 |

**TUNING THE PARAMETER K:**

Selecting the proper value of k is crucial as it can affect the performance of the model.

If K is too small K=1 the model will have low bias but high variance(Overfitting) and for higher values of K, say K=n where n is the number of points; the model will have high bias but low variance(Underfitting).

So we need to find a balance between bias and variance.

However, there is no standard way to choose the value for k, so we have to experiment with different values of k. It is recommended to select the value of K as an odd number to avoid ties.

We’ll use cross-validation to select the best value of k.

In case if you don’t know what cross-validation is I have written an article explaining different types of cross-validation. You can read it here.

1 2 3 4 5 6 7 8 9 10 |
from sklearn.model_selection import cross_val_score from numpy import arange ks = list(arange(1, 50, 2)) scores = [] for k in ks: knn = KNeighborsClassifier(n_neighbors=k) score = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') scores.append(score.mean()) mse = [1 - x for x in scores] |

Now we can plot the MSE against K using matplotlib

1 2 3 4 5 |
import matplotlib.pyplot as plt plt.plot(ks, mse) plt.xlabel('K') plt.ylabel('MSE') plt.show() |

The best value of k corresponds to the lowest misclassification error.

Let’s print the value of k with the lowest misclassification error.

1 |
print(ks[mse.index(min(mse))]) |

1 2 |
OUTPUT: 7 |

Now let’s train the model with n_neighbors = 7

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=7) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print(f'Accuracy:', (accuracy_score(y_test, y_pred))) |

1 2 |
OUTPUT: Accuracy: 0.9494949494949495 |

**WEIGHTED K-NN:**

One of the adjustments made to the K-NN algorithm is by assigning a weight to the nearest neighbors. The closer a point is to the neighbor, the more weight that neighbor gets.

For example, let’s take the following example.

For 5-NN, the five closest neighbors are three triangles and two circles.

But since the two circles are closer to the point they have more weight compared to the triangles, so the new point will be classified as a circle.

**SUMMARY:**

In this article, we discussed one of the simplest classifiers K-nearest neighbor.

Technically as there is no training is involved, it is also called a lazy learning algorithm.

We also implemented the K-Nearest Neighbor algorithm using scikit-learn and discussed how to tune the parameter K.

Finally, we discussed weighted K-NN, which is an extension of the K-NN algorithm, where the neighbors which are closer to the new observation get more weight in deciding the class of that observation.