Understanding the F1 Score in Machine Learning

The F1 score is a measure of a model’s accuracy that takes both precision and recall into account. It is the harmonic mean of precision and recall, giving a balanced view of the performance, especially for binary classification tasks.

Precision and Recall

Precision is the proportion of positive predictions that are actually correct:

Precision = True Positives / (True Positives + False Positives)

Recall (also known as Sensitivity or True Positive Rate) is the proportion of actual positives that are correctly predicted:

Recall = True Positives / (True Positives + False Negatives)

The F1 Score

The F1 Score is the harmonic mean of precision and recall, where higher values indicate a better balance between the two metrics:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Why is F1 Score More Useful than Accuracy in Class Imbalance Datasets?

Accuracy is the most basic metric, calculated as the ratio of correct predictions (true positives + true negatives) to the total predictions. It works well when the classes are balanced, but it can be misleading in imbalanced datasets.

Class Imbalance Example:

Suppose you have a binary classification problem where 95% of the data belongs to Class A (majority class) and 5% belongs to Class B (minority class).

If a model predicts all samples as Class A, it would still have an accuracy of 95%, even though it’s completely ignoring Class B.
This high accuracy score hides the fact that the model is not identifying any of the minority class, which could be the class you care about more.

Why F1 Score is Better:

The F1 score addresses the imbalance problem by focusing on both precision and recall, giving you a better measure of how well the model is identifying the minority class.

Low precision: Means many of the positive predictions are incorrect (e.g., falsely labeling Class A as Class B).
Low recall: Means the model is missing a lot of actual positive instances (e.g., failing to detect Class B).

The F1 score gives a more balanced view by penalizing models that excel at only one of these (high precision or high recall), ensuring that both false positives and false negatives are considered.

Example of Class Imbalance:

Let’s say you are building a model to detect fraud (Class B) in a dataset with 100,000 transactions, where only 500 are fraudulent:

Accuracy could be 99.5% by just predicting every transaction as non-fraud, but this model would be useless because it detects no fraud.
F1 Score would be low in this case, as recall would be 0, reflecting the model’s inability to identify fraud.

Conclusion

For class-imbalanced datasets, the F1 score provides a better evaluation of the model’s performance by balancing precision and recall, especially when one class is significantly underrepresented. The accuracy score can be misleading in these situations, often overestimating a model’s performance.