Classification Metrics: Sensitivity, Specificity, and Beyond
Accuracy, sensitivity (recall), specificity, positive predictive value, negative predictive value, and why accuracy alone is almost always misleading in medical applications.
The Setup
A classification model takes some input (an image, a record, a waveform) and assigns it to one of a handful of categories: pnemonia/healthy, malignant/benign, refer/don’t refer, etc. The metrics in this article are the tools we use to grade how well the model does that job.
If you’ve ever read a diagnostic accuracy study, most of this vocabulary will feel familiar. Sensitivity and specificity were not invented by computer scientists, they were borrowed (with some new accessories) from the diagnostic test literature. The shift to machine learning has not changed what these metrics measure. It has changed how easy it is to game them, and how casually they’re sometimes reported.
This section is therefore a bit of a cautionary tale, since AI projects can abuse these statistics in a way you may not have seen before. So let’s go through them carefully, in the order you’ll probably encounter them in any AI paper, and call out some issues you should be alert to.
Accuracy: Not Very Useful
Accuracy comes up a lot because “how accurate is the model?” is a natural question for people to ask. Accuracy is the fraction of predictions that were correct. If your model classified 950 cases correctly out of 1,000, your accuracy is 95%.
That sounds great until you remember in a medical application, whatever you’re looking for is usually pretty rare. If you’re diagnosing whether a skin blemish might be malignant, you can expect it will be normal 990 times out of a thousand. Your “model” could be a single line of code that returned “Normal, go about your day”, and it would be 99% accurate. This problem is called class imbalance, and it’s the default state for clinical data. Most patients in any screening population do not have the condition you’re screening for.
You sometimes see accuracy reported in AI papers, but should not give it much weight unless it’s an unusual situation where the classes are fairly balanced.
Sensitivity and Specificity: That’s the stuff.
All clinical researcher know about sensitivity and specificity, but we’re including them so that the cross-references later make sense and because the ML community sometimes uses slightly different names for them (especially if they’re coming from CompSci rather than medicine).
Sensitivity (which computer scientists often call recall, and math people call true positive rate) is the fraction of actually-positive cases that the model correctly identified. If 100 patients in your dataset have the disease and your model flags 88 of them, sensitivity is 88%.
Specificity (also called true negative rate) is the fraction of actually-negative cases that the model correctly identified. If 900 patients are disease-free and your model correctly lets 855 of them go, specificity is 95%.
These are properties of the test (in our case, the model). Since they do not depend on the prevalence of disease in the population, they are comparable across studies. They are also, taken on their own, easy to manipulate: a model that says “positive” to everything has 100% sensitivity and a model that says “negative” to everything has 100% specificity. But since they trade off against each other, if they are reported together we can get a good idea of model performance.
A note on vocabulary: ML practitioners (especially computer scientists) often use recall where you’d say sensitivity, and precision where you’d say positive predictive value. If you’re reading a paper written by computer scientists who recently discovered medicine, mentally translate.
Positive and Negative Predictive Value
Although sensitivity and specificity are comparable across studies, they are insufficient because they don’t take into account disease prevalence. Sensitivity and specificity describe the test. PPV and NPV describe what a particular result means in a particular population.
Positive Predictive Value (PPV): if the model says positive, what’s the probability the patient is actually positive?
Negative Predictive Value (NPV): if the model says negative, what’s the probability the patient is actually negative?
These depend on disease prevalence. A 95% sensitive, 95% specific model might sound great, but if you deploy it in a population where 1% of patients have the disease, the PPV is around 16%. Out of every 100 positive flags the model raises, only 16 are true positives. The other 84 are people you’ve just told (or implied to their clinician) that they might be sick.
This is exactly the phenomenon Bayes-aware clinicians worry about with any screening test, and it is the single most common reason that an AI tool which looked great on a balanced research dataset disappoints in the real world. The model didn’t get worse. The denominator just changed.
When you read a paper, look for PPV and NPV reported at the disease prevalence the device will be used at. If the test population had 50% disease prevalence and the real world has 2%, the headline PPV is essentially science fiction. Both GMLP Principle 8 and FDA reviewers (who care for the same reason) expect this to be addressed explicitly in submissions.
A Worked Example
Imagine a model with sensitivity 90% and specificity 95%, deployed in a screening setting where disease prevalence is 2%.
Out of 10,000 patients:
- 200 have the disease. The model correctly flags 180 (90% sensitivity), misses 20.
- 9,800 do not. The model correctly clears 9,310 (95% specificity), incorrectly flags 490.
So the model produces 180 + 490 = 670 positive flags. Of those, only 180 are true positives.
PPV = 180 / 670 = 27%.
Roughly three out of four “positive” findings from this excellent-looking model are false positives. That’s the reality of screening at low prevalence, and it has nothing to do with whether the model is “good.” This is also why reader studies and prospective validation matter so much. The metrics that look stable on paper behave very differently when they meet a real clinical population.
F1 Score: A Convenient Summary, with caveats as usual
F1 is the first one that might be new to a clinical researcher. Its very common in ML papers. The F1 score is the harmonic mean of precision (PPV) and recall (sensitivity):
F1 = 2 × (precision × recall) / (precision + recall)
It collapses the two numbers into one, which is convenient when you’re comparing dozens of models in a paper or running automated hyperparameter searches. It’s also occasionally useful for class-imbalanced problems where accuracy is misleading.
There is a catch. F1 weights precision and recall equally, which is almost never what you want in a clinical setting. In a screening context, missing a case is much worse than a false alarm, while in a confirmatory test, the opposite. F1 ignores that, so while it’s a useful summary for engineers comparing model versions, it’s more a starting point for clinical evaluation.
Likelihood Ratios
Likelihood ratios are a bit more uncommon. They should be probably more popular than they are, especially in the medical AI literature.
Positive Likelihood Ratio (LR+): how much more likely a positive result is in a diseased patient than in a non-diseased one. LR+ = sensitivity / (1 - specificity).
Negative Likelihood Ratio (LR-): how much more likely a negative result is in a diseased patient than in a non-diseased one. LR- = (1 - sensitivity) / specificity.
The reason they’re useful is that they let you translate any patient’s pre-test probability into a post-test probability, using the same model results. A model with LR+ of 20 multiplies the pre-test odds of disease by 20 when it fires. A radiologist looking at a chest X-ray with a known clinical context can use that to actually update what they believe, instead of just reading “positive” and shrugging.
Many AI papers don’t report these. If you’re a reviewer or a reader who wants to do Bayesian thinking about clinical utility, you can compute them yourself from any reported sensitivity and specificity.
Where Your Threshold Comes From
Most classifiers don’t actually output “positive” or “negative.” They output a continuous score (often a probability between 0 and 1) and as a separate step outside the model pick a threshold that divides positive from negative.
Sensitivity, specificity, PPV, NPV, and F1 all going to depend on where that threshold is placed. Slide the threshold lower and sensitivity goes up while specificity goes down, and the opposite if you slide it higher. There are several reasonable ways to pick a threshold:
- Maximize Youden’s J statistic (sensitivity + specificity - 1). This is a common default and lands you near the “elbow” of the ROC curve.
- Fix sensitivity at a clinically required minimum (say, 95% for cancer screening) and report whatever specificity you get there. This is honest and clinically grounded.
- Minimize expected cost if you have credible numbers for the cost of a missed case and the cost of a false positive. Usually you don’t, so this is more theoretical than practical.
- Whatever maximizes the test set score. This is a bad reason and you should feel bad if you consider it. It is also fairly common.
If a paper doesn’t tell you how the threshold was chosen, it should raise at least one of your eyebrows. If you figure out that the threshold was tuned on the test set, both of them should be waggling. See 4.6 Cross-Validation Strategies for why.
Multi-Class Classification: It’s Just More of the Same
So far we’ve talked about binary classification (disease vs. no disease). Many clinical problems are multi-class: classify a lesion as benign, indeterminate, or malignant or classify an arrhythmia as one of seven types. For these multi-class problems, the same metrics apply but you have to choose how to aggregate them across classes:
- Macro-averaged: compute the metric separately for each class, then average. Treats all classes equally regardless of how common they are.
- Micro-averaged: pool all the predictions together first, then compute. Dominated by the most common class.
- Per-class: report each class separately. Most informative, also most space-consuming.
For clinical purposes, per-class is almost always what you want to see. A model that’s 95% accurate on the common class and 30% accurate on the rare-but-critical class can hide that disparity behind a macro-average.
The Metrics Section in a Real AI Paper: What to Look For
When you read a published clinical AI paper, it can be helpful to walk through a mental checklist:
- Is accuracy the headline number? If yes, what’s the class balance? Could a trivial baseline have done about as well?
- Are sensitivity and specificity both reported, with confidence intervals? A point estimate without a CI is almost meaningless in any realistic sample size. (More on this in 4.5 Statistical Considerations.)
- Is PPV reported at the relevant clinical prevalence? If the study only reports results on a 50/50 case-control design, you need to translate the numbers to the prevalence in the actual deployment population.
- How was the operating threshold chosen, and on which data? If the threshold was selected on the test set, you’re looking at optimistic numbers.
- Are subgroup metrics reported? Performance by age, sex, race, scanner manufacturer, site. If the paper aggregates everything into one number, you have no idea whether the model works for everyone. Section 7.2 goes deeper on this.
- What’s the reference standard? A model can only be as good as the truth it was trained against. See GMLP Principle 5 and 5.7 Reference Standards.
It will be rare for a paper to check all these boxes, but if it fails most of them you can’t really know whether the model works.
A Brief Word on Confidence Intervals
This will get its own treatment in 4.5, but worth flagging here: every metric in this article is an estimate from a finite sample. The “true” sensitivity of your model on the population it will be deployed against is unknown, and what you computed on your test set is just one draw.
Report (and expect) confidence intervals. A reported sensitivity of “92%” might have a 95% CI of 88-95% (probably fine) or 73-98% (basically uninformative). The point estimate alone tells you almost nothing without that range.
Key Takeaways
- Accuracy alone is misleading whenever the classes are imbalanced, which is almost always in clinical settings.
- Sensitivity and specificity describe the test. They don’t depend on disease prevalence and are the right primary metrics for technical performance.
- PPV and NPV describe what a result means in a particular population, and they depend heavily on prevalence. Furthermore the important PPV is the one at deployment prevalence, rather than test set prevalence.
- Likelihood ratios are the most clinically actionable summary and the most underused. Consider computing them yourself if the paper doesn’t.
- Every binary metric depends on the threshold.
- Per-class and per-subgroup metrics are important information for any model that will see a real clinical population.
- A point estimate without a confidence interval is half a number.
What to Read Next
- 4.2 The ROC Curve and AUC: how to evaluate a model across all possible thresholds at once.
- 4.4 Calibration: a clinically important metric that is usually overlooked
- 4.5 Statistical Significance, Confidence Intervals, and Sample Size: how to put error bars around everything.
- 5.7 Reference Standards and Ground Truth in Medical AI: why your metrics inherit the limitations of your “truth.”
- 6.7 Good Machine Learning Practice, Principle 8: the FDA’s view on what “clinically meaningful performance testing” looks like, and why it overlaps strongly with this article.
- 7.2 Demographic Fairness: Performance Across Populations: the subgroup-metrics conversation.
This article is part of the AI in Clinical Research Knowledge Base.