Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Distance/Similarity Measures in Machine Learning

INTRODUCTION:

For algorithms like the k-nearest neighbor and k-means, it is essential to measure the distance between the data points.

In KNN we calculate the distance between points to find the nearest neighbor, and in K-Means we find the distance between points to group data points into clusters based on similarity.

It is vital to choose the right distance measure as it impacts the results of our algorithm.

In this post, we will see some standard distance measures used in machine learning.

EUCLIDEAN DISTANCE:

This is one of the most commonly used distance measures.

It is calculated as the square root of the sum of differences between each point.

distance measures in machine learning

distance measures in machine learning

In simple words, Euclidean distance is the length of the line segment connecting the points.

Euclidean distance is also known as the L2 norm of a vector.

MANHATTAN DISTANCE:

Also called as the city block distance or L1 norm of a vector.

Manhattan distance is calculated as the sum of absolute distances between two points.

manhattan-distance

CHEBYSHEV DISTANCE:

It is calculated as the maximum of the absolute difference between the elements of the vectors.

chebyshev distance

It is also called the maximum value distance.

MINKOWSKI DISTANCE:

The Minkowski distance is just a generalized form of the above distances.

Minkowski distance is also called as p-norm of a vector.

minkowski

MINKOWSKI FOR DIFFERENT VALUES OF P:

For, p=1, the distance measure is the Manhattan measure.

p=2, the distance measure is the Euclidean measure.

p = ∞, the distance measure is the Chebyshev measure.

HAMMING DISTANCE:

We use hamming distance if we need to deal with categorical attributes.

Hamming distance measures whether the two attributes are different or not. When they are equal, the distance is 0; otherwise, it is 1.

We can use hamming distance only if the strings are of equal length.

For example, let’s take two strings “Hello World” and “Hallo Warld

The Hamming distance between these two strings is 2 as the string differs in two places.

COSINE SIMILARITY:

It measures the cosine angle between the two vectors.

distance measures in machine learning

 

Cosine similarity ranges from 0 to 1, where 1 means the two vectors are perfectly similar.

If the angle between two vectors increases then they are less similar.

cosine formula

Cosine similarity cares only about the angle between the two vectors and not the distance between them.

Assume there’s another vector c in the direction of b.

distance measures in machine learning

What do you think the cosine similarity would be between b and c?

The cosine similarity between b and c is 1 since the angle between b and c is 0 and cos(0) = 1.

Even though the distance between b and c is large comparing to a and b cosine similarity cares only about direction of the vector and not the distance.

JACCARD SIMILARITY AND DISTANCE:

In Jaccard similarity instead of vectors, we will be using sets. It is used to find the similarity between two sets.

Jaccard similarity is defined as the intersection of sets divided by their union.

Jaccard similarity between two sets A and B is

jaccard distance

JACCARD DISTANCE:

We use Jaccard distance to find how dissimilar two sets are.

1 – jaccard_similarity will give you the Jaccard distance.

SUMMARY:

In this post, I have discussed various distance measures in machine learning. Now the question is which distance measure you should choose?

You should choose the right distance measure based on the properties of our data.

Euclidean distance can be used if the input variables are similar in type or if we want to find the distance between two points.

In the case of high dimensional data, Manhattan distance is preferred over Euclidean.

The Hamming distance is used for categorical variables.

Cosine similarity can be used where the magnitude of the vector doesn’t matter. Both Jaccard and cosine similarity are often used in text mining.

Love What you Read. Subscribe to our Newsletter.

Stay up to date! We’ll send the content straight to your inbox, once a week. We promise not to spam you.

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe Now! We'll keep you updated.