Say that we are using a model to predict the outcomes of inputs and the outcomes are binary– positive or negative. Some examples are:
Predicting whether a patient has COVID 19: positive vs. negative
Predicting whether it will rain tomorrow: positive vs. negative
Predicting whether we will have an AGI system by 2030: positive vs. negative
Predicting whether the sentiment of the sentence “I love my cat.” (positive vs. negative)
Predicting whether a vision model classifies an image as cat vs. non-cat (positive vs. negative)
Note the following:
Even though the above setup assumes a prediction outcome, it does not imply that the model is necessarily assigning probabilities for the positive and negative outcomes. For example, we can have a model that just makes random guesses on the outcomes.
But usually our models have fixed logit assignments (numbers assigned to the positive and negative outcomes). These logits are transformed to a probability distribution using some form of normalization (typically a softmax). Moreover, in order to make a prediction, normalization is not required and one can simply select the maximum logit.
We can evaluate such models with the accuracy metric: \[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}\].
The accuracy metric is limited. If I claim that my model has an accuracy of 70%, it tells us that we were right 70% of the time, but doesn’t tell us about its performance on the positive and negative classes separately.
Confusion matrix
A toy model
We can describe the totality of predictions on both classes using the confusion matrix. The rows of the matrix represent the ground truth values of the two classes, and the columns represent the predicted values on them.
For example, say there are 100 patients, and our model makes predictions on whether they are infected with the COVID-19 virus. 50 patients were infected and the remaining 50 were not. Out of the 50 patients that were infected, the model predicted 30 of them correctly and out of the 50 remaining patients, the model predicted 40 of them correctly. Our confusion matrix is shown in Figure 1.
The second model has the same accuracy as the first model (70%).
The second model is correct 50% of the time on the negative classes, while it is correct 90% of the time on the positive classes.
The second model is better at classifying the positive cases than the first model (90% compared to 60%).
The second model has more false positives– in the majority of cases where the outcome is negative, it is claiming that the patient is COVID-19 positive.
True positive rate and true negative rate
As observed above, it is revealing to consider the accuracy for the actual positive and negative outcomes separately. These are called the true positive rates(TPR) and true negative rates(TNR) respectively. True positive rate is also called recall(Kent et al. 1955).
Here TP stands for true positives (actual positive and model predicts positive), and TN stands for true negatives (actual negative and model predicts negative). FP stands for false positives (actual negative and model predicts positive) and FN stands for false negatives (actual positive and model predicts negative).
Now let us consider an imbalanced dataset. Consider a quality control (QC) department that detects defects in a manufactured item from the conveyor belt. The manufacturing uses highly precise instruments and hence defects are rare. The confusion matrix is shown in Figure 3.
Most items are non-defective (95) and the rest of them (5) are defective. The QC model correctly predicts 90 of the non-defective items. The accuracy is 92%, but a trivial model that assigns all the items as non-defective would still have an accuracy of 95%, but is rather useless. In this trivial model’s case, the TPR is 0% and TNR is 95%.
Sometimes it helps to analyze the precision: \[\text{Precision} := \frac{TP}{TP + FP}\]
Here we ask: out of all the times the model predicts positive, how many times was it actually correct? Precision by itself is not a decisive metric. Precision and recall should both be considered at the same time.
F1 score
We have analyzed two metrics that measure the performance of models on predicting the positive class: recall and precision. Recall tells us what fraction of the ground truth positive instances did the model correctly classify as positive, while precision tells us what fraction of the positively predicted instances did the model get correct predictions. We note that recall is the ratio along the row for the ground truth positive class and precision is the ratio along the column for the predicted positive class.
For a perfect model, both recall and precision are 1.0 as the off-diagonal entries of the confusion matrix are zeros. A model which has high recall (closer to 1.0) but low precision (closer to 0.0) is good at flagging the positive class, but at the same time is flagging aggressively as there are relatively many false positives. On the other hand, a model which has a low recall but a high precision, is flagging precisely (low false positive rate) but is missing out a lot of the ground truth positive cases as it has relatively high false negatives (positive but classified negative).
We want a single measure that characterizes a model based on both precision and recall. It should be high when both of them are high and penalize if at least one of the two metrics is low. We can take the arithmetic mean of the two, but it does not perform well on all cases. For example, if we travel at the speed of light (c) to the moon and then 60 km/hr back around, the average speed is not \((c + 60)/2\) (this is quite large \(\sim 5.395 \times 10^8 \text{km/hr}\)). The actual average speed is
What is the vector space on which the confusion matrix acts? (diagonal good, non-diagonal bad)
Can there be a map from the space of model parameters to the space of confusion matrices?
References
Kent, Allen, Madeline M. Berry, Fred U. Luehrs Jr., and J. W. Perry. 1955. “Machine Literature Searching VIII. Operational Criteria for Designing Information Retrieval Systems.”American Documentation 6 (2): 93–101. https://doi.org/https://doi.org/10.1002/asi.5090060209.
Rijsbergen, C. J. van. 1979. Information Retrieval. 2nd ed. London: Butterworths.