Calibration: Does a 90% Confidence Score Really Mean 90%?

Why model confidence scores are often poorly calibrated, why this matters for clinical decision-making, and how to measure and improve calibration.

The Problem in One Sentence

Most modern AI classifiers output a number between 0 and 1 that looks like a probability. A clinician sees “73% probability of malignancy” and acts on it. The unfortunate reality is that the model’s “73%” isn’t exactly a probability, at least not the kind of probability the clinician is probably looking for.

The clinician wants to know the probability that a lump is malignant. The model doesn’t give that probability, because the training process never explicitly required the output numbers to mean anything in particular. The model learned to rank cases (which is what AUC measures), and the scores got squeezed through a final layer that makes them look probability-shaped (usually a softmax function). Whether those probabilities are actually calibrated to clinical probabities is a separate question, and a separate piece of work.

Calibration is the property of a model where, when it says 70%, it’s actually right 70% of the time. A perfectly calibrated model has the property that if you collect every case where it predicted 0.7, exactly 70% of them turn out to be positive. If you collect every case where it predicted 0.2, exactly 20% are positive. And so on, across the full range of scores.

Why Discrimination and Calibration Are Different Things

Two models can have identical AUC and very different calibration. Imagine Model X assigns scores around 0.55 to all healthy patients and around 0.65 to all diseased patients. The ranking is perfect, AUC = 1.0. But anyone seeing a “0.65” thinks “65% probability of disease,” when actually in this dataset every patient scoring 0.65 has the disease. We would say the model discriminates perfectly but is poorly calibrated.

Now imagine Model Y assigns scores spread across the full 0-1 range, with very good calibration: a score of 0.65 really does correspond to ~65% true disease prevalence. But the ranking is sloppy: some healthy patients score 0.7, some diseased patients score 0.4. AUC might only be 0.78, even though calibration is excellent.

Both models are useful, in different ways. Model X is great for triage (rank them, treat the top of the list). Model Y is great for decision support (if it says 65%, the clinician can actually use that number to update their Bayesian intuition).

In clinical practice, you usually want both. Discrimination (AUC) and calibration are independent properties of a model.

How to Measure Calibration

There are a few standard metrics.

Reliability Diagrams (Calibration Plots)

Reliability diagrams are usually the first measurement of calibration one makes. You bin your test predictions by score (say, 0-10%, 10-20%, …, 90-100%). For each bin, you compute two numbers, the average predicted probability for cases in that bin, and the actual fraction of positives among them. Plot one against the other.

A perfectly calibrated model produces points lying along the diagonal while a model that systematically over-predicts disease will sit below the diagonal. A model that under-predicts will sit above it.

Reliability diagrams are pictures, and the shape of a miscalibration curve tells you a lot. Modern deep learning models, for instance, tend to be overconfident (the high-probability bins sit below the diagonal: the model says 95% but the truth is 80%). Logistic regression models tend to be reasonably well-calibrated. Random forests tend to under-spread (their probabilities cluster around 0.5 rather than spreading across the full range).

Expected Calibration Error (ECE)

For comparison between models it can be helpful to quantify the deviation from the diagonal seen in a reliability diagram. The ECE is a single-number summary of how far the reliability curve is from the diagonal. For each bin, compute the weighted absolute difference between predicted and actual probability, weighted by the number of samples in the bin and add them up.

ECE has well-known limitations (it depends on how you choose your bins, and it averages errors that might point in opposite directions). It’s a reasonable first-pass summary but it’s a poor substitute for actually looking at the calibration curve.

Brier Score

The Brier score is the mean squared error between predicted probabilities and actual outcomes (0 or 1). It captures both discrimination and calibration in a single number, which is sometimes helpful and sometimes confusing. Lower is better. Brier scores below 0.1 are typically considered well-performing for clinical risk prediction; below 0.05 is excellent. As with all single-number summaries, treat it as a way to compare models and choose the ones to dive into more.

The Brier score can be decomposed mathematically into separate terms for calibration, refinement, and the irreducible noise from the data itself.

Hosmer-Lemeshow Test

A familiar goodness-of-fit test from clinical prediction modeling, this metric tests whether observed and predicted event rates differ significantly across deciles of risk. Like all goodness-of-fit tests, it has shortcomings: it can fail to detect important miscalibration in some cases, and it can be overly sensitive in others. It’s doesn’t seem to be used much in modern research but you might see it in older clinical AI papers.

Why Calibration Matters Clinically

The need for calibration will be obvious to most clinicians, but it may not be intuitive for pure compsci researchers who are taking on a medical learning task. Here are some examples where a poorly calibrated model might be misused in clinical settings.

Shared decision-making. “Mrs. Patel, the model estimates your 5-year cardiovascular event risk at 18%.” If that 18% is actually closer to 9% (the model is overconfident at the high end), Mrs. Patel might be put on a statin she doesn’t need. Or, if the model under-predicts, she might forgo treatment she does need.

Risk-stratified treatment. Modern oncology often uses model-predicted recurrence risk to decide treatment intensity. If the predicted “30% risk” really means 18%, you’re over-treating one patient population. If it really means 45%, you’re under-treating another.

Combining with other tests. If a clinician wants to use the model’s output as a likelihood ratio for Bayesian updating against their clinical impression, the math only works if the probabilities are calibrated, otherwise the updates are wrong.

Threshold-based decision rules. “Refer if model risk > 5%” is a common pattern. If the model is poorly calibrated at low probabilities, that 5% threshold is meaningless.

This is also why GMLP Principle 8 emphasizes “clinically meaningful performance testing.” Reporting AUC without calibration assessment is incomplete, and the FDA has flagged this risk.

Why Models Get Miscalibrated

A few common reasons, in rough order of frequency:

Modern neural networks are intrinsically overconfident. This is a real, well-documented phenomenon. Deep networks trained with cross-entropy loss tend to push their outputs toward the extremes (0 and 1), even when the actual evidence doesn’t justify that confidence. Larger networks are typically worse than smaller ones, which might be a strange and counterintuitive finding but is consistently observed.

Class imbalance. If you trained on 50% positives but you’ll deploy in a population with 2% prevalence, the model’s output scores will be biased high. A “60% predicted probability” in training space might correspond to a much lower true probability in the deployed population.

Distribution shift. If the test population differs from the training population in any meaningful way (different scanner, different demographics, different prevalence), calibration can collapse even if discrimination holds up reasonably well. This is closely related to the data drift discussion in 6.12 Post-Market Surveillance.

Overfitting. A model that has memorized the training set will be very confident about its training-set predictions. Those high confidences won’t generalize, and you’ll see poor calibration on the test set.

Label noise. If the training labels are noisy (e.g., not all “positive” cases in your dataset really have the disease), the model learns to predict probabilities that reflect both the underlying biology and the noise. The probabilities can still be useful, but they’re shifted.

Calibration Methods: Fixing It After the Fact

Most discrimination work happens during training. Most calibration work happens after training, on a held-out calibration set. (Note: this is in addition to the test set. Mixing them up is a form of data leakage.)

The two most common methods to calibrate a model are:

Platt scaling. Fit a logistic regression on the model’s raw outputs (or pre-softmax logits) against the true labels in the calibration set. This is simple and works well as long as the miscalibration follows a more-or-less smooth function.

Isotonic regression. Fit a monotonic step function from model output to calibrated probability. This is more flexible than Platt and can capture arbitrary monotonic miscalibration, however it needs more data to fit well and can overfit on small calibration sets.

For multi-class problems, there are extensions (temperature scaling is a popular and well-behaved choice for neural networks; it fits a single “temperature” parameter that softens the model’s overconfident outputs).

A few practical notes:

Calibrate on a separate set. Don’t tune calibration on your test set. That set must remain untouched.
Recalibrate after deployment. The deployed population’s prevalence and characteristics drift over time. Calibration should be monitored, not assumed. (See 6.12 on real-world monitoring.)
Discrimination doesn’t always improve after calibration. Sometimes it stays the same, sometimes it drops slightly. That’s expected, it’s a tradeoff.

Calibration and Regulatory Submissions

For any AI/ML device that outputs a probability or confidence score and where that score will influence clinical decision-making, the FDA increasingly expects calibration to be assessed and reported. This isn’t yet a formal hard requirement in every clearance pathway, but it shows up in:

GMLP Principle 8 (clinically meaningful performance testing).
GMLP Principle 9 (clear essential information for users): if your label says “70% probability,” the user is entitled to assume that means something close to 70%.
Post-market surveillance requirements: calibration drift is one of the most common forms of silent performance degradation.

If your device displays a probability or confidence score to a clinician, you should report calibration plots, ECE or Brier scores, and a calibration assessment in subgroups. If your device is a binary classifier that doesn’t expose probabilities to users, calibration matters less for the user-facing claim, but still matters for the threshold-selection story.

What to Do When You Read a Paper

A short checklist for evaluating calibration in a published AI study:

Is calibration even mentioned? If not, consider whether the output probability could be misinterpreted in the field.
Is there a reliability diagram? If yes, eyeball it. Does the curve hug the diagonal across the relevant probability range?
Was calibration assessed on a held-out set? If they used the same data for calibration and evaluation, the calibration numbers are optimistic.
Are calibration plots reported in subgroups? Calibration can be excellent overall and poor in a particular demographic. This is exactly the kind of thing 7.2 addresses.
What does the paper recommend the threshold be? If they’re recommending a threshold (e.g., “refer if score > 0.7”), is that threshold actually meaningful given the model’s calibration?

Key Takeaways

Calibration and discrimination are independent. A high AUC says nothing about whether the model’s probabilities mean what they say.
Modern neural networks are typically overconfident. Expect to need post-hoc calibration before reporting probabilities to clinicians.
Reliability diagrams are the most useful single tool. Always look at the curve, not just the summary metric.
Calibrate on a held-out set, not your test set. Information can leak from your test set if you use it for calibration.
Calibration drifts in deployment. Monitoring is the only way to catch it before it does damage.
If your model shows clinicians a probability, that probability is a claim. Make sure it’s a claim you can back up.