Accuracy, Specificity, Precision, Recall, and F1 Score for Model Selection

You must have heard about the accuracy, specificity, precision, recall, and F score since they are used extensively to evaluate a machine learning model. You must have come across 2 specific types of errors called “type 1” and “type 2” errors. In this post, we will cover all these matrices one by one.
To understand them and their importance, we need to go a little bit deeper and talk about all possible outcomes (4 in total) of our everyday models:
Let’s say, you have built the famous “cat vs dog” supervised model equivalent to the “hello world” in each programming language. Here, allow me to take a guess and assume that you are a cat person or you have become one after watching so many cute videos of them online like me. Therefore, you set the agenda/aim of this machine learning project to find the cat images from the mixed set.
Let me repeat, you want your model to find cat images for you. Since you have only 2 classes/categories before you, the cat class becomes the positive class and the dog class becomes the negative class automatically. Once you run the model on 1000s of images, the model is going to give you good, bad, and ugly results. We must look at these results and evaluate our model so that it can be tuned further and improved. So far, so good, right?
Confusion Matrix
We can start by dividing the results as below where we have the original values or the “actual values” on the Y axis, and the results spitted by the model or “predicted values” on the X axis. By looking at the overall results, anyone can see that the model is not really predicting right. It’s marking some cats as dogs and vice versa. Our model is definitely confused and this confusion is nicely exposed by the below image. All we have to do is fill up this image with corresponding numbers and we would have our confusion matrix.

Have a look at the above image again. The class labelling for both the axes i.e. X and Y flow positive to negative when you start from the top left. A lot of times, you will see some people having this arrangement a little mixed up, so you need to really keep that in mind while dealing with a confusion matrix. Now, we need numbers for 4 cases/possibilities to get our confusion matrix, These cases are:
True Positive (TP)
Correctly predicted positive class. This is where our model comes out as a hero and calls a positive sample well, a positive sample. This is where it calls our cat, a cat, successfully. We want our model to do more of this, right? It occupies the top left of the confusion matrix (at least in our case).
Examples:
- Our model correctly labelling its positive class i.e. cats as cats
- A Face detection model successfully identifies the correct user
- The Shephard shouting “wolf” when there is an actual wolf
False positive (FP)
Did you notice that our model marked some dogs as cats? If not, then check out the bottom left side. Imagine, an alarm clock going off even where is no fire. Why? Because its algorithm “predicted fire” even though there was “no actual fire”. Remember, the story of the shepherd and the wolfs? Who used to “shout wolf” all the time even when there was “no actual wolf”. Well, we know how he ended up and we definitely don’t want our model to behave in the same fashion. Such results are called false positives and this type of error is called Type 1 error.
Examples:
- A smoke alarm going off even when there is no fire
- The shepherd shouted wolf even when there was no wolf around
- A Covid RT-PCR test marks you positive even when you don’t have covid. This test had a quite significant false positive rate of 5% according to this paper on the impact of false positive COVID-19 results in an area of low prevalence.
- A recommendation system displaying an incorrect different result
True Negative (TN)
Check out the bottom right case, even though our model was built keeping cats in mind, it’s also predicted dogs, its negative class, successfully. We want more of this from our model, right? Well, yes, but it depends, but we definitely want our model to yield the true positives as much as it can.
Examples:
- Our model predicts dogs successfully
- The shepherd does not shout wolf when there was no wolf
- A pregnancy test machine correctly calls a woman not pregnant when she is not.
False Negative (FN)
Check out the top right image. The model marked an image of an actual cat, a dog. This error is called a “Type 2” error or false negative.
Example:
- Our model marking cats as dogs
- A cancer detection machine marking an actual cancer patient, not a cancer patient
- A covid test marking an actual covid person, a non-covid patient

We have covered all 4 types of results. Can you think of more examples of the above cases? If yes, please write them down in the comment section for everyone else.
Type 1 vs Type 2 Errors
What do you think is more tolerable aka less risky?
A cancer detection machine indicating that a person has cancer even where they don’t? OR indicating that a person doesn’t have cancer even when they do have cancer?
A smoke alarm going off when there is no fire? Or the same alarm staying put even where there is a fire?
Let me put it in a different way: If you were to design a covid-detection kit, and you had 2 options before you: flag more test-takers as positive even why they might not have Covid, OR miss out on actual covid patients in order to either not flag them unnecessary or to comply with your government which has asked you to keep the numbers low for their good image (read the article)I am sure that you can understand the cost of both the errors here. The cost of the type-2 error is very high for serious illness detection-related machines. They must keep it as minimum as possible. This leads to an increase in type-1 error. This issue is taken care off by training the assistants, nurses, and doctors to read the output report thoroughly.

Congratulations! we have covered the building blocks. Let’s go through the matrices.

Accuracy
Overall, how often is the classifier correct? One of the significant parameters in determining the accuracy of the classification problems, explains how regularly the model predicts the correct outputs and can be measured as the ratio of the number of correct predictions made by the classifier over the total number of predictions made by the classifiers. In terms of the confusion matrix, it is given by:

This is quite straightforward, right? The same formula can be rephrased as:

Sensitivity vs Specificity
Since there are a lot of variants of their definitions, have a look at the one from the department of state, New York state:
“sensitivity and specificity are measures of a test’s ability to correctly classify a person as having a disease or not having a disease. Sensitivity refers to a test’s ability to designate an individual with the disease as positive. A highly sensitive test means that there are few false negative results, and thus fewer cases of the disease are missed. The specificity of a test is its ability to designate an individual who does not have a disease as negative. A highly specific test means that there are few false positive results. It may not be feasible to use a test with low specificity for screening, since many people without the disease will screen positive, and potentially receive unnecessary diagnostic procedures.
It is desirable to have a test that is both highly sensitive and highly specific. This is frequently not possible. Typically there is a trade-off. For many clinical tests, there are some people who are clearly normal, some clearly abnormal, and some that fall into the gray area between the two. Choices must be made in establishing the test criteria for positive and negative results.”
I hope the difference is crystal clear now. Let’s look at their formulas:
Specificity (True negative rate)

Recall or Sensitivity
Recall or sensitivity refers to the fraction of relevant items that an AI search returns out of the total number of relevant items in the original population. If there are 18 relevant documents in the whole population and the search returns 9 relevant items, recall is 50%.
Recall tells you how well a search finds relevant items.

Precision
Precision refers to the percentage of relevant versus irrelevant items that a search returns. If a search returns 12 items from the total population, 9 of the items are relevant, and 3 are irrelevant, the precision is 60%.
Precision tells you how well a search avoids false positives. Both precision and recall are important to the success of a search.

F Score:
Alright, so we see there are so many metrics. Which one should we chase? It’s quite difficult to chase more than 1 metric. One solution to this problem is to create a combined formula of precision and recall and get one score. This score is called F-score:

Since different machine learning problems require different importances to precision and recall, we can adjust the beta accordingly.
F1 Score
One of the most famous variant of the F score is F1 score with value of beta being 1. In this case, the precision and recall get equal importance. Once you substitute beta as 1, the formula converts itself to the harmonic mean of precision and recall.

This F1 score or simply F score is heavily in the machine learning problems as a measurement of the model’s accuracy, especially in binary classification systems. It is also commonly used for evaluating information retrieval systems such as search engines, and natural language processing.
It is possible to adjust the F-score to give more importance to precision over recall, or vice-versa. Common adjusted F-scores are the F0.5-score and the F2-score, as well as the standard F1-score.
Calculation of F-score
With the help of an example: – Let us imagine we have a tree with ten apples on it. Seven are ripe and three are still unripe, but we do not know which one is which. We have an AI which is trained to recognize which apples are ripe for picking and pick all the ripe apples and no unripe apples. We would like to calculate the F-score, and we consider both precision and recall to be equally important, so we will set β to 1 and use the F1 score.
The AI picks five ripe apples but also picks one unripe apple. We can represent the true and false positives and negatives in a confusion matrix as follows:
Confusion Matrix
The model’s precision is the number of ripe apples that were correctly picked, divided by all apples that the model picked. The recall is the number of ripe apples that were correctly picked, divided by the total number of ripe apples.
Precision-Recall – F1 Score
Have you ever been in a situation, where you were not sure about which metric to apply to evaluate your models? This article written by Rohit Pant will make things easier to understand.
Hi Vaishnavi,
Nice website!
However, I can’t see the pictures of the formulas on this webpage (Chrome+Safari) on my MacBook.
Hi, thanks for pointing it out. We’ll fix it immediately.
Hello.
If we want to apply a classification model to imbalanced data we should
– Use a cost matrix or apply weights for the clases.
– Avoid using the Accuracy to optimize the model, and use the F1 instead, or even better the AUC ROC or the AUC PR.
My question is… should we use only one of these recommendations or both simultaneously?
I mean, if we are using the AUC_PR… do we still need to apply weights to the input?
and vice versa
if we are using weights… Is it still recommended to use the AUC_PR instead of simply the Accuracy?