The K-Nearest Neighbor(KNN) classifier is one of the easiest classification methods to understand and is one of the most basic classification models available.

KNN is a non-parametric method which classifies based on the distance to the training samples.

KNN is called a lazy algorithm. Technically, it does not build any model with training data; i.e., it does not really learn anything in the training phase.

Actually, in the training phase, it just stores the training data in the memory and works in the testing phase.

In this tutorial, we’ll implement KNN from scratch using numpy.

**KNN:**

Let’s take a look at K-nearest neighbors from a graphical perspective.

Let’s suppose that we have a dataset with two classes circle and triangle. Visually, this looks like the following.

Now let’s say we have a mystery point whose class we need to predict.

To find out which class it belongs to we need to compare the distance(euclidean) of the mystery point to the training samples and selecting the K nearest neighbors.

The k indicates the number of close training samples to be regarded when predicting an unlabeled test point.

The class label of the new point is determined by a majority vote of its k nearest neighbors. The new point will be assigned to the class with the highest number of votes.

For example, if we choose the value of k to be 3 then the three closest neighbors of the new observation are two circles and one triangle.

Therefore by majority vote, the mystery point will be classified as a circle.

The K-NN algorithm can be summarized as follows:

- Calculate the distances between the new input and all the training data.
- Find the nearest neighbors based on these pairwise distances.
- Classify the point based on a majority vote.

Now let’s create a simple KNN from scratch using Python.

First, let’s import the modules we’ll need and create the distance function which calculates the euclidean distance between two points.

1 2 3 4 5 |
import numpy as np import operator def euc_dist(x1, x2): return np.sqrt(np.sum((x1-x2)**2)) |

Now let’s define the class

1 2 3 4 5 6 7 8 |
class KNearestNeighbors(): def __init__(self, K=3): self.K = K def fit(self, x_train, y_train): self.X_train = x_train self.Y_train = y_train |

In the init function we’ll initialize K(number of nearest neighbors) to 3.

Since KNN does not build any model with training data, we’ll just store the values.

Now let’s define our predict method

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
def predict(self, X_test): # list to store all our predictions predictions = [] # loop over all observations in the test set for i in range(len(X_test)): # calculate the distance between the test point and all other points in the training set dist = np.array([euc_dist(X_test[i], x_t) for x_t in self.X_train]) # sort the distances and return the indices of K neighbors dist_sorted = dist.argsort()[:self.K] # get the neighbors neigh_count = {} # for each neighbor find the class for idx in dist_sorted: if self.Y_train[idx] in neigh_count: neigh_count[self.Y_train[idx]] += 1 else: neigh_count[self.Y_train[idx]] = 1 sorted_neigh_count = sorted(neigh_count.items(), key=operator.itemgetter(1), reverse=True) # append the class label to the list predictions.append(sorted_neigh_count[0][0]) return predictions |

**Line 4 **we have created an empty list to store all our predictions.

In **Line 7** we are looping over all the points in the test set.

**Line 10 **we are calculating the distance between the test point and all other points in the training set

Then in **Line 13 **we sort the distances using argsort() and store the first K distances in a list. The argsort will return the indices of the K nearest points.

**Line 16 **we have created an empty dictionary that stores the neighbors and its count.

From **Line 19 to 23 **we loop through all the points in the list dist_sorted and add it to the dictionary.

If the neighbor’s class is already present in the dictionary we increase the count else we’ll set it equal to 1.

**Line 25 **we’ll sort the dictionary in the descending order based on the values. The values in the dictionary are the number of votes for that specific class.

The operator.itemgetter(1) in the key tells the sorted method to sort the dictionary based on the values in the dictionary.

Finally, we’ll append the class label to our list.

Let’s test out the KNN model. As usual, first, let’s import the necessary modules.

1 2 3 4 5 6 |
import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score np.random.seed(2020) |

Now we can create a dataset with 150 points and 2 classes and split the data into train and test set

1 2 |
data,target = make_classification(n_samples=150, n_classes=2) X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2) |

Now we can call the fit and predict method from our KNN class

1 2 3 4 5 6 |
clf = KNearestNeighbors(K=5) clf.fit(X_train, y_train) predictions = clf.predict(X_test) print('Accuracy:', accuracy_score(y_test, predictions)) |

1 |
Accuracy: 0.9333333333333333 |

Complete code for this tutorial can be found in this Github Repo.