Comparing Models: When Is One Model Truly Better Than Another?

Statistical tests for comparing classifier performance (McNemar's test, DeLong's test for AUC comparison), and avoiding the pitfall of selecting models based on noise in the test set.

The Question We Actually Want to Answer

You’ve built two models. Maybe one is the new architecture you spent six months on, and the other is the baseline you started with. Maybe they’re your model and a competitor’s published numbers. Maybe they’re the same architecture trained with two different hyperparameter sets.

Model A scores AUC = 0.872 on the test set. Model B scores 0.851.

Is A really better than B? Or did A just happen to do better on this particular test set, while in reality the two models are about the same?

This is a question that clinical researchers answer all the time when comparing treatments. AI papers ask the same question, but a remarkable proportion of them answer it wrong. The math is well-known. The wrong answer is widespread.

The Naive Comparison and Why It Fails

The obvious thing to do is compute confidence intervals for each AUC separately and check whether they overlap. If A’s CI is 0.84-0.90 and B’s is 0.82-0.88, they overlap, so we conclude there’s no significant difference.

This is wrong, and it’s wrong in an interesting way. Non-overlapping CIs imply a significant difference, but overlapping CIs don’t imply no difference. The reason: the two CIs are computed on the same patients. The errors of model A and model B are correlated. A paired test accounts for that correlation; comparing separate CIs throws it away.

For binary classification, paired tests can produce a “significant difference” between two models even when their individual CIs overlap heavily. We’ve seen cases where the AUC difference is 0.01 and a properly-paired DeLong test rejects equivalence, because the two models agree on most patients and disagree consistently on a small number where the predicted scores diverge meaningfully.

The takeaway: when you’re comparing two models tested on the same data, use a paired test. When you’re comparing models on different datasets (which you should probably avoid anyway), you’re stuck with the looser unpaired comparison.

DeLong’s Test: The Standard for AUC Comparisons

If you remember one tool from this article, make it DeLong’s test.

DeLong’s method computes the asymptotic variance of the difference between two correlated AUCs and gives you a p-value (or, equivalently, a confidence interval for the AUC difference). It’s the right test for comparing two models that were evaluated on the same test set. It’s implemented in essentially every modern statistical package (R’s pROC, Python’s scikit-learn ecosystem, MedCalc, you name it). There’s no excuse for not using it.

A few practical notes:

It’s a paired test. Both models need to provide predictions for every test case. You can’t apply it if your models were evaluated on different subsets.

It handles ties correctly. Older “naive” approaches don’t, which can matter when many cases have identical scores.

It works for both ROC and PR comparisons, with appropriate extensions for the latter.

It assumes asymptotic normality, which holds well except in tiny samples. For test sets under 100 cases, prefer bootstrap-based comparisons.

For multi-model comparisons (three or more models on the same test set), you can apply pairwise DeLong tests, but you’ll need to correct for multiple comparisons. Bonferroni is the easy default.

McNemar’s Test: For Binary Outcomes

When you’re comparing two models on a binary classification decision at a fixed threshold (rather than on AUC across all thresholds), the right tool is McNemar’s test.

The logic is clinical-trial-familiar. For each patient, model A’s prediction is correct or incorrect. So is model B’s. You get a 2×2 table:

	A correct	A wrong
B correct	a	b
B wrong	c	d

The cells that matter are b and c: cases where the two models disagree. McNemar’s test asks whether the disagreement is symmetric (about the same in both directions) or asymmetric (one model is systematically right when the other is wrong).

This is the right test for comparing classification accuracy at a fixed threshold. It’s also the right test for comparing sensitivity at a fixed specificity, or specificity at a fixed sensitivity, as long as you’ve pre-specified the operating point.

A common misuse: applying a chi-squared test on a 2×2 table of “model A correct/wrong” vs “model B correct/wrong” treating them as independent samples. They’re not independent. Use McNemar.

Comparing Calibration

If you’re claiming your model is better-calibrated than another, the question is harder. There isn’t a single dominant test in the literature. Reasonable options include:

Paired bootstrap on Brier score. Compute Brier for each model on the same test set, take the difference, bootstrap-resample to get a CI on the difference. If the CI excludes zero, you can claim a difference.

Visual comparison of reliability diagrams. Less formal, but often more informative than a single number, especially if the miscalibrations are in different directions or in different probability ranges.

Calibration intercept and slope. Fit a logistic regression of true outcomes against the log-odds of the model’s predictions. A well-calibrated model has intercept = 0 and slope = 1. Compare across models.

This is one of those areas where the formal hypothesis testing is less mature than for discrimination. Be honest about what you can and can’t claim.

Comparing Models on Cross-Validated Results

This gets thornier. When you have K cross-validated performance estimates per model, the temptation is to compute means and standard deviations and run a t-test.

This is wrong, for a subtle reason. The K cross-validation folds are not independent. The same patients appear in different training folds, and the folds were constructed from the same dataset. The naive t-test treats them as if they were independent, which underestimates variance.

Better options:

Paired bootstrap on the per-fold performance differences.
Repeated K-fold cross-validation (with different random seeds) followed by an appropriately-conservative analysis.
Nadeau-Bengio corrected t-test for repeated K-fold, which adjusts for the dependence between folds.
Pre-registering a single held-out test set and using paired DeLong / McNemar on that. Often the cleanest solution.

For a regulatory submission, the FDA generally expects performance comparisons on the held-out test set, not on cross-validation. Cross-validated comparisons can be useful during development, but the formal claim should come from a single test.

The Test Set Selection Problem

A model that does best on a particular test set isn’t necessarily the best model. It’s the model that happened to do best on that test set. If you tried 50 models and picked the one with the highest AUC on the test set, the AUC of the chosen model is biased upward. The expected gap between “best in your hand” and “actual best” grows with how many models you tried.

This is the same multiple-comparisons phenomenon as in subgroup analysis, just applied to model selection. The corrections are similar:

Don’t use the test set for model selection. Use a validation set. The test set is for one final measurement.
Pre-specify the model architecture and hyperparameters before touching the test set. If you find yourself running multiple architectures on the test set “to see,” you’ve contaminated the evaluation.
If you must compare many models, use a Bonferroni or similar correction.

In a regulatory context, the FDA expects the device to be locked before primary evaluation. You can’t submit a 510(k) for “whichever model we choose later.” The locked-device principle is the formal version of “don’t use the test set for selection.” (See GMLP Principle 2 on the software engineering side of this.)

A Word on Effect Sizes

Statistical significance is necessary but not sufficient. With a large enough test set, almost any difference becomes statistically significant. A paired DeLong test on 50,000 patients might confidently reject the null hypothesis when the actual AUC difference is 0.002. Mathematically meaningful, clinically irrelevant.

For clinical AI work, report the effect size with a CI, not just a p-value. “Model A’s AUC was higher than model B’s by 0.018 (95% CI: 0.005-0.031, p < 0.01)” is a much better summary than “model A was significantly better than model B (p < 0.01).” The reader can then judge whether 0.018 means anything clinically.

A useful rule of thumb: in most clinical AI applications, AUC differences below 0.02 are unlikely to translate into meaningful differences in clinical outcomes. They might still be worth pursuing if other factors (cost, speed, interpretability) favor one model, but the AUC difference alone shouldn’t carry the argument.

Comparing AI Against Clinicians

A specific case of “comparing models” is the AI-vs-clinician comparison that drives many published headlines.

The right design here is the reader study, where a panel of clinicians and the AI both read the same cases, and you compare performance directly. This is a paired-comparison setup (same cases, multiple readers), and the right statistical machinery is similar to what we’ve been describing.

A few specific issues:

The “AI beats radiologists” trap. This claim has been made many, many times, and most of the time it turns out to be misleading. Common reasons: the radiologists were operating without the full clinical context they’d have in practice; the test set was enriched with hard cases; the AI was tuned on the same kind of data the test set came from while the radiologists were generic. Be very skeptical of these claims unless the study design is genuinely fair.

The right comparator is usually AI-assisted clinician, not AI vs. clinician. GMLP Principle 7 (human-AI team performance) frames this directly. The clinical question is not “does the AI outperform the clinician?” but “do clinicians using the AI outperform clinicians without it?”

Inter-reader variability is a real complication. Different radiologists disagree on the same cases. The AI is a single decision-maker; the radiologists are a distribution. Comparing a point against a distribution requires care.

Comparing Against Published Results

This one’s always treacherous. You read that competitor X published an AUC of 0.91 on a different test set. Your model gets 0.89 on yours. Are you behind?

Honestly: you don’t know. The two AUCs were computed on different patients, different prevalence, different scanners, different reference standards. The comparison is meaningless in any formal sense, even if it’s all anyone wants to talk about.

What to do:

If you can get the competitor’s model and run it on your test set, do that. Now you have a paired comparison and the formal tools apply.
If you can publish your model evaluated on the same benchmark dataset they used, do that. Public benchmarks are valuable for exactly this reason.
If neither is possible, acknowledge that the comparison is informal. Don’t claim superiority based on numbers from different evaluations.

For regulatory purposes, the FDA is uninterested in informal cross-paper comparisons. The submission needs to stand on its own evidence.

A Pre-Specification Checklist for Model Comparison

If you’re designing a study where you’ll compare two models (your model vs. baseline, your model vs. competitor, your model vs. previous version), pre-specify:

The primary endpoint. AUC? Sensitivity at a fixed specificity? PPV in a defined population? A single metric, decided in advance.
The statistical test. DeLong for AUC, McNemar for binary outcomes, bootstrap for unusual metrics.
The test set. Locked, held out, ideally externally collected.
The hypothesis. Superiority? Non-inferiority with margin X? Equivalence within ±Y?
Subgroup comparisons. Which subgroups, with what corrections for multiple comparisons.
The operating threshold(s). If you’re comparing sensitivity/specificity, where are those measured?

The CONSORT-AI guidelines cover the broader study design. The list above is the minimum statistical pre-specification.

When the Differences Are Real

Sometimes, after all this care, model A really is better than model B. The difference is statistically significant, the effect size is clinically meaningful, the comparison was paired and pre-specified, the test set was held out and representative.

The right thing to do then is: take the win, document the comparison carefully, and move on. Resist the temptation to also report that model A was better in subgroups you didn’t pre-specify, or at thresholds you didn’t pre-register. That’s where good comparisons turn into bad papers.

Key Takeaways

Use paired tests for paired data. Same test set, two models = paired test. DeLong for AUC, McNemar for binary outcomes.
Non-overlapping CIs imply a difference; overlapping CIs do not imply no difference. Don’t use CI overlap as your test.
Statistical significance is not the same as clinical significance. Report effect sizes with CIs.
The test set is for one measurement. Don’t use it for model selection.
Pre-specify the comparison. Endpoint, test, threshold, hypothesis, all before touching the test set.
For AI-vs-clinician, the right comparison is usually AI-assisted clinician vs. unassisted clinician. See GMLP Principle 7.
Cross-paper comparisons are informal. They’re useful for orientation, not for claims of superiority.