INTRODUCTION:
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the distance between the data points.
In KNN we calculate the distance between points to find the nearest neighbor, and in K-Means we find the distance between points to group data points into clusters based on similarity.
It is vital to choose the right distance measure as it impacts the results of our algorithm.
In this post, we will see some standard distance measures used in machine learning.
EUCLIDEAN DISTANCE:
This is one of the most commonly used distance measures.
It is calculated as the square root of the sum of differences between each point.
In simple words, Euclidean distance is the length of the line segment connecting the points.
Euclidean distance is also known as the L2 norm of a vector.
1 2 3 4 5 6 |
def euclidean(x, y): distance = 0 for a, b in zip(x, y): distance += (sum([(pow((a-b),2))])) return sqrt(distance) print("Euclidean Distance:",euclidean([1,3,4,1],[3,2,1,1])) |
1 2 |
OUTPUT: Euclidean Distance: 3.7416573867739413 |
MANHATTAN DISTANCE:
Also called as the city block distance or L1 norm of a vector.
Manhattan distance is calculated as the sum of absolute distances between two points.
1 2 3 4 5 6 |
def manhattan(x, y): distance=0 for a,b in zip(x,y): distance += sum([abs(a-b)]) return distance print("Manhattan Distance:",manhattan([1,3,4,1],[3,2,1,1])) |
1 2 |
OUTPUT: Manhattan Distance: 6 |
CHEBYSHEV DISTANCE:
It is calculated as the maximum of the absolute difference between the elements of the vectors.
It is also called the maximum value distance.
1 2 3 4 5 6 7 |
def chebyshev(x,y): distance = [] for a,b in zip(x,y): distance.append(abs(a-b)) print(distance) return max(distance) print("Chebyshev Distance:",chebyshev([1,3,4,1],[3,2,1,1])) |
1 2 |
OUTPUT: Chebyshev Distance: 3 |
MINKOWSKI DISTANCE:
The Minkowski distance is just a generalized form of the above distances.
Minkowski distance is also called as p-norm of a vector.
MINKOWSKI FOR DIFFERENT VALUES OF P:
For, p=1, the distance measure is the Manhattan measure.
p=2, the distance measure is the Euclidean measure.
p = ∞, the distance measure is the Chebyshev measure.
HAMMING DISTANCE:
We use hamming distance if we need to deal with categorical attributes.
Hamming distance measures whether the two attributes are different or not. When they are equal, the distance is 0; otherwise, it is 1.
We can use hamming distance only if the strings are of equal length.
For example, let’s take two strings “Hello World” and “Hallo Warld”
The Hamming distance between these two strings is 2 as the string differs in two places.
1 2 3 4 5 6 7 |
def hamming(x,y): distance = 0 for a,b in zip(x,y): if a != b: distance += 1 return distance print("Hamming Distance:",hamming("hello world","hallo warld")) |
1 2 |
OUTPUT: Hamming Distance: 2 |
COSINE SIMILARITY:
It measures the cosine angle between the two vectors.
Cosine similarity ranges from 0 to 1, where 1 means the two vectors are perfectly similar.
If the angle between two vectors increases then they are less similar.
Cosine similarity cares only about the angle between the two vectors and not the distance between them.
Assume there’s another vector c in the direction of b.
What do you think the cosine similarity would be between b and c?
The cosine similarity between b and c is 1 since the angle between b and c is 0 and cos(0) = 1.
Even though the distance between b and c is large comparing to a and b cosine similarity cares only about the direction of the vector and not the distance.
1 2 3 4 5 6 7 8 9 10 11 |
def cosine_similarity(x,y): numerator = 0 sum_x = 0 sum_y = 0 for a,b in zip(x,y): numerator += sum([a * b]) sum_x += sum([a**2]) sum_y += sum([b**2]) denominator = round(sqrt(sum_x) * sqrt(sum_y)) return numerator / denominator print("Cosine Similarity:", cosine_similarity([1,3,4,1],[3,2,1,1])) |
1 2 |
OUTPUT: Cosine Similarity: 0.7 |
JACCARD SIMILARITY AND DISTANCE:
In Jaccard similarity instead of vectors, we will be using sets. It is used to find the similarity between two sets.
Jaccard similarity is defined as the intersection of sets divided by their union.
Jaccard similarity between two sets A and B is
JACCARD DISTANCE:
We use Jaccard distance to find how dissimilar two sets are.
1 – jaccard_similarity will give you the Jaccard distance.
1 2 3 4 5 6 |
def jaccard_similarity(x,y): intersection = len(set(x).intersection(set(y))) union = len(set(x).union(set(y))) return (intersection / union) print("Jaccard Similarity:",jaccard_similarity([0,1,2,5,6], [0,2,3,4,5,7,9])) print("Jaccard Distance:", 1-jaccard_similarity([0,1,2,5,6],[0,2,3,4,5,7,9])) |
1 2 3 |
OUTPUT: Jaccard Similarity: 0.3333333333333333 Jaccard Distance: 0.6666666666666667 |
SUMMARY:
In this post, I have discussed various distance measures in machine learning. Now the question is which distance measure you should choose?
You should choose the right distance measure based on the properties of our data.
Euclidean distance can be used if the input variables are similar in type or if we want to find the distance between two points.
In the case of high dimensional data, Manhattan distance is preferred over Euclidean.
The Hamming distance is used for categorical variables.
Cosine similarity can be used where the magnitude of the vector doesn’t matter. Both Jaccard and cosine similarity are often used in text mining.
The code for this blog post can be found in this Github Repo.