Confusion Matrix Calculator

Compute accuracy, precision, recall, F1, MCC, kappa, and 15+ classification metrics from TP/FP/FN/TN values with visual matrix and performance bars.

About the Confusion Matrix Calculator

A confusion matrix is the foundational tool for evaluating binary classification models. It breaks down predictions into four categories: True Positives (correctly identified positives), False Positives (incorrectly flagged as positive), False Negatives (missed positives), and True Negatives (correctly identified negatives). From these four numbers, a wealth of performance metrics can be derived.

This calculator computes over 15 classification metrics including accuracy, precision, recall (sensitivity), specificity, F1 score, Matthews Correlation Coefficient (MCC), Cohen's kappa, balanced accuracy, Youden's J index, likelihood ratios, F0.5, F2, and more. Each metric captures a different aspect of classifier performance, and no single metric tells the whole story.

Understanding when to prioritize which metric is crucial. In medical diagnosis, you want high recall (don't miss sick patients) even at the cost of some false positives. In spam filtering, high precision matters (don't send real email to spam). The presets demonstrate common scenarios across medical testing, spam filtering, fraud detection, and image recognition.

Why Use This Confusion Matrix Calculator?

Evaluating classification models requires more than just accuracy. This calculator provides a comprehensive dashboard of 15+ metrics from four simple inputs, saving time and reducing calculation errors. The visual confusion matrix and performance bars make it easy to spot strengths and weaknesses at a glance.

It's invaluable for machine learning practitioners, medical researchers evaluating diagnostic tests, students learning classification evaluation, and anyone who needs to communicate model performance clearly. The presets covering different domains help build intuition about which metrics matter in which context.

How to Use This Calculator

Enter the four confusion matrix values: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).
Use presets for common scenarios like medical testing, spam filtering, or fraud detection.
Review the color-coded confusion matrix visualization showing the four quadrants.
Examine the primary output cards: accuracy, precision, recall, specificity, F1, and MCC.
Check the extended metrics table for comprehensive evaluation including kappa and likelihood ratios.
Compare metrics visually using the performance bars at the bottom.

Formula

Accuracy = (TP + TN) / (TP + FP + FN + TN) Precision = TP / (TP + FP) Recall = TP / (TP + FN) Specificity = TN / (TN + FP) F1 = 2 × Precision × Recall / (Precision + Recall) MCC = (TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Example Calculation

Result: Accuracy: 98.5%, Precision: 90%, Recall: 94.7%, F1: 0.923

A medical test correctly identifies 90 of 95 positive cases (94.7% recall) with only 10 false alarms among 905 negatives (98.9% specificity). The F1 score of 0.923 reflects strong overall positive-class performance.

Tips & Best Practices

Never rely on accuracy alone — especially with imbalanced classes.
MCC is often considered the single best metric for binary classification overall performance.
High recall is critical in medical screening; high precision matters in precision-critical tasks like legal search.
F1 score only considers the positive class — use balanced accuracy or MCC for a fuller picture.
Prevalence affects PPV and NPV dramatically — the same test performs differently in high- vs. low-prevalence populations.
Try the "Perfect Classifier" preset to see what ideal metrics look like, then compare with your model.

Understanding the Confusion Matrix

The confusion matrix is organized with actual classes as rows and predicted classes as columns (though conventions vary). True Positives and True Negatives sit on the main diagonal — these are correct predictions. False Positives (type I errors) and False Negatives (type II errors) are the off-diagonal cells representing mistakes.

Every classification metric derives from these four counts. Accuracy uses all four; precision and recall focus on the positive class; specificity focuses on the negative class. The choice of which metric to optimize depends on the costs of different types of errors.

The Precision-Recall Trade-off

Precision and recall are inversely related in practice. Making a classifier more conservative (requiring stronger evidence for positive predictions) increases precision but decreases recall — fewer false positives, but more missed positives. Conversely, a more liberal threshold catches more positives (higher recall) but at the cost of more false alarms (lower precision). The F1 score represents the harmonic mean of both, penalizing extreme imbalances between them.

Matthews Correlation Coefficient

MCC, introduced by biochemist Brian Matthews in 1975, is increasingly recognized as the most informative single metric for binary classification. Unlike accuracy, it accounts for all four quadrants. Unlike F1, it doesn't ignore true negatives. An MCC of 0 indicates random prediction; +1 is perfect; −1 is total disagreement. Several studies have shown MCC to be the most reliable metric when classes are imbalanced.

Frequently Asked Questions

Why is accuracy misleading for imbalanced datasets?

If 99% of cases are negative, a model that always predicts negative achieves 99% accuracy while catching zero positives. Precision, recall, F1, and MCC are more informative for imbalanced data because they focus on the minority class performance.

When should I use MCC instead of F1?

MCC considers all four quadrants of the confusion matrix and produces a balanced measure that works well even with class imbalance. F1 focuses only on the positive class. MCC ranges from −1 to +1, where 0 is random and 1 is perfect.

What is the difference between F0.5 and F2 scores?

F0.5 weights precision higher than recall — use it when false positives are costly (e.g., legal search). F2 weights recall higher — use it when false negatives are costly (e.g., disease screening). F1 weights them equally.

How do I interpret Cohen's kappa?

Kappa measures agreement beyond chance. Values below 0.20 indicate poor agreement, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and above 0.80 almost perfect agreement.

What are likelihood ratios used for?

LR+ (positive likelihood ratio) tells you how much more likely a positive test result is in a truly positive person compared to a truly negative person. LR+ > 10 strongly confirms the condition; LR− < 0.1 strongly rules it out.

Can I use this for multi-class classification?

This calculator handles binary classification. For multi-class problems, you can compute a confusion matrix for each class (one-vs-rest) and then macro- or micro-average the metrics.