Why do we need performance measures?
Why do we need performance measures at all?
After we developed our classification model, we need to asses the performance of the model.
Obviously, we can use accuracy as a metric to evaluate the model performance.
However, is accuracy alone enough to assess the performance of the model? Alternatively, do we need some other performance measures?
Let’s find out in this post.
In this post, we’ll discuss why accuracy alone can mislead us in thinking that our model is performing well and see some other types of performance measures like confusion matrix, precision, recall, F1 score.
Accuracy is one type of metric or measure to assess the performance of our model. It is the most common and intuitive of all performance measures.
Accuracy is simply the percentage of correct predictions.
Accuracy ranges from 0 to 1. Higher the number better it is.
Now can we say higher the accuracy better the performance of our model?
The answer is No.
Accuracy is a great measure only when we have a balanced dataset. If we have an imbalanced dataset, then it gives high accuracy even for a dumb model.
Suppose that we have a dataset where there are a total of 1000 points out of which 950 points belong to the positive class, and 50 points belong to the negative class.
Assume we have a dumb model which doesn’t have any logic and declares all points as positive.
If we run this model on our dataset, we’ll get an accuracy of 95%.
Does this mean our model is performing well?
No. Our model is not doing anything appreciable. It is just declaring all the points to be positive.
This problem is called accuracy paradox.
If we have a highly imbalanced dataset, it gives high accuracy even though we have a dumb model.
The confusion or error matrix is a square matrix which is used to define the performance of a classifier. It reports the number of correct and incorrect predictions of a classifier.
As we see earlier, we cannot depend only on the classification accuracy as it can mislead if we have an imbalanced dataset.
The following diagram is the representation of the confusion matrix.
Let’s understand what’s written inside each of these cells.
TRUE NEGATIVE (TN):
These are all classes correctly classified as negative.
FALSE POSITIVE (FP):
These are all negative classes predicted as positive. FP also called as a TYPE-1 error. In FP you predict something will happen and it doesn’t happen.
FALSE NEGATIVE (FN):
These are all positive classes predicted as negative. FN also called as TYPE-2 error. In FN, you predict that something won’t happen, but it does happen.
TRUE POSITIVE (TP):
These are all classes correctly classified as positive.
If we add TP and FN, we’ll get the total number of positive points
Similarly to get the total number of negative points, we should add TN and FP
We always want the elements in the principal diagonal should be large
That is TN and TP should be large because these are correct predictions.
Moreover, the off-diagonal values should be small as these are incorrect predictions.
Now let’s take an example and understand all of these terminologies more clearly.
Assume that we have to predict whether a customer purchases a product or not.
Fortunately, sklearn provides inbuilt function to compute the confusion matrix let’s use that.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
This is the confusion matrix for our classification model
Here True Negatives are 64, i.e., customers who did not purchase the product, and the predicted is also negative.
False Positives are 4, the case where the customers did not purchase the product, but it was classified as purchased.
False Negatives are 3, the case where the customers purchased the product, and it was classified as not purchased.
True positives are 29, i.e., customers who purchase the product, and the predicted is also positive.
Now that we’ve discovered what confusion matrix is let’s comprehend a few more performance measures centered on the confusion matrix
What proportion of points are positive out of all the points predicted to be positive.
Out of all customers our model predicted to purchase the product, what proportion of them actually purchased the product.
Precision ranges from 0 to 1, where 1 is best.
TRUE POSITIVE RATE OR RECALL:
Of all the actual positive points, how many are predicted to be positive.
Recall tells us that out of all customers who purchased the product how many of them are predicted as purchased by our model.
It is calculated as the number of true positives divided by the total number of positives.
TRUE NEGATIVE RATE OR SPECIFICITY:
Calculated as the number of true negative divided by the total number of negatives.
Specificity tells us how many customers who did not purchase the product, predicted as not purchased.
TNR ranges from 0 to 1, where 1 is best.
FALSE NEGATIVE RATE:
Calculated as the number of incorrect negative predictions divided by the total number of positives.
FALSE POSITIVE RATE:
Calculated as the number of incorrect positive predictions divided by the total number of negatives.
F1 score combines both precision and recall into a single metric.
The F1 score calculates the harmonic mean of precision and recall.
Let’s calculate precision, recall, and f1 score for our example.
We’ll use sklearn to calculate these three metrics.
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision: %.3f' % precision_score(y_test, y_pred))
print('Recall: %.3f' % recall_score(y_test, y_pred))
print('F1: %.3f' % f1_score(y_test, y_pred))
This is the output we got
Earlier we see that accuracy is not reliable in case of an imbalanced dataset.
Let’s see how we can use the confusion matrix along with the four rates TPR, TNR, FPR, and FNR to tackle that problem.
Remember, a good model always has high TPR and TNR rates and low FPR and FNR rates.
Let’s take the example where we have an imbalanced dataset with 950 points belong to the positive class, and 50 points belong to the negative class.
If we run a dumb model which predict all the points as positive on this dataset, then the confusion matrix looks like this.
As you can see, all the points are predicted as positive.
Now we calculate all the four rates TPR, TNR, FPR, FNR.
TPR = 950/950 = 100% TNR = 0 / 50 = 0%
FPR = 50 / 50 = 100% FNR = 0 / 950 = 0%
By seeing these rates we can say that our model is doing something crazy because a good model always has high TPR and TNR and low FPR and FNR rates.
So, even for an imbalanced dataset by using the confusion matrix, we can get an idea of how our model is performing.
So far, we have discussed performance measures like accuracy, confusion matrix, precision, recall, and f1 score.
We also see that how accuracy alone is not enough to assess the performance of the model as accuracy is not reliable in case of an imbalanced dataset.
We discussed the confusion matrix through a simple classification example and we also see how to calculate TPR, TNR, FPR, FNR.
In the next part, we’ll learn about performance metrics like log loss, ROC, and AUC curves.