Box plot, also known as box-and-whisker plot, helps us to study the distribution of the data and to spot the outliers effectively. It is a very convenient way to visualize the spread and skew of the data.
It is created by plotting the five-number summary of the dataset: minimum, first quartile, median, third quartile, and maximum.
BOX PLOT:
The following diagram represents the box plot.
The first thing you might notice in the preceding diagram is a box that contains a horizontal line.
The box represents two inner quartiles where 50% of the data resides, and it ranges from the first quartile to the third quartile.
The horizontal line represents the median of the data.
If the median is not in the middle of the box, then the distribution is skewed.
The distribution is positively skewed if the median is closer to the bottom. If the median is closer to the top, then the distribution is negatively skewed.
The First Quartile(Q1) is the 25th percentile value of the data. It is also called the lower quartile.
The Third Quartile(Q3) is the 75th percentile of the data. It is also called the upper quartile.
Quartiles are a special case of a type of statistics called quantiles, which are numbers dividing data into quantities of equal size.
Extending from both the ends of the box plot are called whiskers, which extends till the adjacent values.
The lower adjacent value is the furthest data point that is within 1.5 times the interquartile range(IQR) of the lower end of the box, and the upper adjacent value is the furthest data that is within 1.5 times the IQR of the upper end of the box.
The interquartile range is calculated as IQR = Q₃ − Q₁.
Any data points past the whiskers ends are considered as outliers and represented with circles or diamonds.
Let’s load the iris dataset and the necessary packages to begin with.
You can download the iris dataset using the following link.
1 2 3 4 5 |
import matplotlib.pyplot as plt import pandas as pd import seaborn as sns iris = pd.read_csv('Iris.csv') |
Now let’s see how to create a box and whiskers plot
1 2 |
sns.boxplot(data=iris.drop('Id',axis=1)) plt.show() |
We can also give an x and y values.
1 2 |
sns.boxplot(x="Species", y="SepalWidthCm", data=iris) plt.show() |
We can also visualize many box plots in a single visualization. The following code produces box plots for individual features.
1 2 3 4 5 6 7 |
var = ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"] fig, ax = plt.subplots(2,2, figsize=(14,6)) for var, subplot in zip(var, ax.flatten()): sns.boxplot(x='Species', y=var, data=iris, ax=subplot) fig.tight_layout() plt.show() |
We can also overlay a swarm plot to show the data points on top of the box.
1 2 3 4 5 6 |
fig, ax = plt.subplots(2,2, figsize=(14,6)) for var, subplot in zip(var, ax.flatten()): sns.boxplot(x='Species', y=var, data=iris, ax=subplot) sns.swarmplot(x='Species', y=var, data=iris, ax=subplot) fig.tight_layout() plt.show() |
By setting the notch argument to True we can create a notched box plot. The notch around the mean represents the confidence interval for the mean.
1 2 |
sns.boxplot(x="Species", y="SepalWidthCm", data=iris, notch=True) plt.show() |
If you want to hide the outliers in your data you can set the showfliers argument to False.
1 2 |
sns.boxplot(x='Species', y='SepalWidthCm', data=iris, showfliers=False) plt.show() |
Another argument which is worth mentioning is the whis. It change the range of the whiskers.
By default, the whiskers extend to 1.5 * IQR. By setting whis=2 we can make them extend to 2 * IQR.
1 2 |
sns.boxplot(x='Species', y='SepalWidthCm', data=iris, whis=2) plt.show() |
The following code will show the 10th and 80th percentile values.
1 2 |
sns.boxplot(x='Species', y='SepalWidthCm', data=iris, whis=[10,80]) plt.show() |
SUMMARY:
In this tutorial, we discussed how to interpret the box plot.
It is built based on the five-number summary: which is the minimum, first quartile, median, third quartile, and maximum.
In a box plot, we draw a box between the first and the third quartile. The horizontal line that goes through the box is the median. The whiskers extends from the ends of the box to the minimum or maximum value.
Any data points past the whiskers ends are considered as outliers.