The K-nearest neighbor(K-NN) classifier is one of the easiest classification methods to understand and is one of the most basic classification models available.

K-NN is a non-parametric method which classifies based on the distance to the training samples.

K-NN is called a lazy algorithm. Technically, it does not build any model with training data; i.e., it does not really learn anything in the training phase.

Actually, in the training phase, it just stores the training data in the memory and works in the testing phase.

Besides classification, K-nearest neighbor is also sometimes used for regression. In the case of regression, we’ll take the mean or median of the k nearest neighbors.

However, in this tutorial, we’ll focus solely on the classification setting.

**K-Nearest Neighbor:**

Let’s take a look at K-nearest neighbor from a graphical perspective.

Let’s suppose that we have a dataset with two classes circle and triangle. Visually, this looks something like the following.

Now let’s say we have a mystery point whose class we need to predict.

To find out which class it belongs to we need to compare the distance of the mystery point to the training samples and selecting the K nearest neighbors.

The k indicates the number of close training samples to be regarded when predicting an unlabeled test record.

The class label of the new point is determined by a majority vote of its k nearest neighbors. The new point will be assigned to the class with the highest number of votes.

For example, if we choose the value of k to be 3 then the three closest neighbors of the new observation are two circles and one triangle.

Therefore by majority vote, the mystery point will be classified as a circle.

The K-NN algorithm can be summarized as follows:

- Calculate the distances between the new input and all the training data.
- Find the nearest neighbors based on these pairwise distances.
- Classify the point based on a majority vote.

Let’s create a K-NN classifier using sklearn.

First, we’ll import the necessary packages from scikit-learn’s library.

1 2 3 4 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score |

Now we’ll create a dataset using sklearn’s make_classification and split the dataset into train and test set.

1 2 3 |
X, y = make_classification(n_samples=300, n_features = 5, n_classes=2, random_state = 192) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 43) |

We need to define the number of nearest neighbors to build a KNN model. For this example, we’ll set the value of k as 3.

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print(f'Accuracy:', (accuracy_score(y_test, y_pred))) |

1 2 |
OUTPUT: Accuracy: 0.9393939393939394 |

**TUNING THE PARAMETER K:**

Selecting the proper value of k is crucial as it can affect the performance of the model.

If K is too small K=1 the model will have low bias but high variance(Overfitting) and for higher values of K, say K=n where n is the number of points; the model will have high bias but low variance(Underfitting).

So we need to find a balance between bias and variance.

However, there is no standard way to choose the value for k, so we have to experiment with different values of k. It is recommended to select the value of K as an odd number to avoid ties.

We’ll use cross-validation to select the best value of k.

Incase if you don’t know what cross-validation is I have written an article explaining different types of cross-validation. You can read it here.

1 2 3 4 5 6 7 8 9 10 |
from sklearn.model_selection import cross_val_score from numpy import arange ks = list(arange(1, 50, 2)) scores = [] for k in ks: knn = KNeighborsClassifier(n_neighbors=k) score = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') scores.append(score.mean()) mse = [1 - x for x in scores] |

Now we can plot the MSE against K using matplotlib

1 2 3 4 5 |
import matplotlib.pyplot as plt plt.plot(ks, mse) plt.xlabel('K') plt.ylabel('MSE') plt.show() |

The best value of k corresponds to the lowest misclassification error.

Let’s print the value of k with the lowest misclassification error.

1 |
print(ks[mse.index(min(mse))]) |

1 2 |
OUTPUT: 5 |

**WEIGHTED K-NN:**

One of the adjustments made to the K-NN algorithm is by assigning a weight to the nearest neighbors. The closer a point is to the neighbor, the more weight that neighbor gets.

For example, let’s take the following example.

For 5-NN, the five closest neighbors are three triangles and two circles.

But since the two circles are closer to the point they have more weight compared to the triangles, so the new point will be classified as a circle.

**SUMMARY:**

In this article, we discussed one of the simplest classifiers K-nearest neighbor.

Technically as there is no training is involved, it is also called as lazy learning algorithm.

We also implemented the K-Nearest Neighbor algorithm using scikit-learn and discussed how to tune the parameter K.

Finally, we discussed weighted K-NN, which is an extension of the K-NN algorithm, where the neighbors which are closer to the new observation gets more weight in deciding the class of that observation.