Statistical Significance, Confidence Intervals, and Sample Size for AI Studies

Applying rigorous statistical methodology to AI evaluation. Confidence intervals for AUC, bootstrapping, multiple comparisons, and sample size calculations for performance studies.

Why Statistical Rigor Is Slipping in AI Papers

Clinical researchers generally take confidence intervals for granted. You wouldn’t dream of reporting a treatment effect size without one: a hazard ratio of 1.4 with a 95% CI of 0.9 to 2.1 tells a completely different story than 1.4 with a CI of 1.3 to 1.5, even though the point estimate is identical. The CI is the part of the result that determines whether to believe anything else.

In the AI literature, this has been slipping. You’ll routinely see papers report “AUC = 0.91” with no CI at all. For example, Nagendran et al., BMJ 2020 (“Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies”) reviewed 81 deep-learning vs. clinician studies and found pervasive methodological weaknesses, including limited reporting of uncertainty. Other similar studies are a Google Scholar search away.

The good news is that the statistical tools you already know mostly carry over. Sensitivity is a proportion, specificity is a proportion, AUC is a U-statistic with well-understood asymptotic properties. We don’t need to invent new statistics for AI evaluation, we just need to actually use the ones we have.

Confidence Intervals for the Standard Metrics

A quick tour of metrics that should include confidence intervals. We’re skipping the formulas because clinical researchers will recognize the methods by name, and the details belong in a stats textbook.

Sensitivity, Specificity, PPV, NPV

These are proportions, so standard binomial confidence intervals apply.

The Wald interval works fine when the proportion is around 0.5 and the sample size is large. It misbehaves badly at the extremes (sensitivities close to 1.0, for instance) and can produce intervals that include impossible values. For clinical AI work, prefer the Wilson score interval or the Clopper-Pearson interval. Both are well-behaved at the extremes. Most modern statistical software defaults to one of these.

A practical note: A lot of training sets have sparse positives, and confidence intervals on sensitivity get wide fast when positives are rare. A sensitivity of 95% (19/20) has a 95% CI roughly from 75% to 99%. That looks much less impressive than the point estimate suggests. If your test set has only 20 positive cases, no amount of careful design will give you a tight estimate of sensitivity. You need more positives.

AUC

The standard method for a CI on AUC is DeLong’s method, which gives an asymptotically correct interval and is implemented in essentially every statistical package. For smaller samples or unusual situations, bootstrap confidence intervals (more on this in a moment) are a good alternative.

For two AUCs being compared on the same test set (the same patients, evaluated by both models), DeLong’s paired test is usually the right tool. See 4.7 Comparing Models for how this works and why “naive” comparisons of two AUCs are misleading.

Dice and IoU

Segmentation metrics don’t have closed-form confidence intervals in most cases. Bootstrap is the default approach. Report a 95% CI from the bootstrap distribution of per-case Dice scores. This works for IoU, Hausdorff distance, volume errors, and most other segmentation summaries.

Calibration Metrics

ECE, Brier score, and calibration slope/intercept all need bootstrap intervals in practice since they’re not metrics with neat asymptotic distributions.

The Bootstrap, in About Three Paragraphs

For clinical researchers who haven’t used the bootstrap much, it’s one of the most useful tools in applied statistics, especially in medical AI.

The idea is that instead of deriving the sampling distribution of your metric mathematically (which can be impossible for complicated metrics like AUC, Dice, or calibration error), you simulate it by resampling. Specifically, you a) take your test set of N cases, b) draw N cases with replacement, and c) compute your metric on the resampled set. Repeat 1,000 or 10,000 times. The distribution of those 1,000 metric values is a Monte Carlo estimate of the sampling distribution of your metric. The 2.5th and 97.5th percentiles give you a 95% CI.

This works for essentially any metric you can compute from your test set. It’s computationally heavier than closed-form methods but trivial on modern hardware.

A subtle point is that when your test set has multiple measurements from the same patient (multiple images, multiple time points, multiple slices), the resampling should happen at the patient level, not the image level. Otherwise the CIs will be artificially tight, because the resampled “test sets” treat correlated images as if they were independent. This is the same independence concern that drives the cross-validation discussion in 4.6.

What “Statistically Significant” Means in This Context

Most AI papers don’t actually do hypothesis testing in the classical sense. They report point estimates and (sometimes, hopefully usually) confidence intervals. When statistical tests do appear, they’re usually about comparing models (4.7) or comparing the AI against a clinical reference (the reader study setup of 5.5).

There are some cases where statistical significance is meaningful, however.

Is the model better than chance? A coin-flip baseline has AUC = 0.5. If your model’s AUC is 0.55 with a CI of 0.49 to 0.61, you can’t claim it’s better than random. This sounds silly, but this bar exists and a remarkable number of small-dataset papers don’t actually clear it.

Is the model better than an existing standard? If existing risk scores achieve AUC = 0.72, and your AI achieves 0.75 with a CI of 0.71 to 0.79, you don’t actually have evidence of improvement since the CIs overlap heavily. (Note that overlap of CIs is not the same as a paired statistical test failing; see 4.7. But it’s a warning sign.)

Is performance equivalent across subgroups? This is a different question than “is the difference statistically significant?”. In subgroup analyses you often want to claim equivalence, which requires equivalence testing (showing the difference is within a pre-specified clinically acceptable range), not the more common superiority testing.

Is the model non-inferior to a comparator? This sounds a bit weird compared to showing your model is better, but it’s common in regulatory contexts. You’re not claiming you’re better than the predicate; you’re claiming you’re not worse, within a defined margin. Non-inferiority testing requires pre-specified margins and a pre-defined endpoint.

A frustration observation is that many AI papers use language like “the model significantly outperformed clinicians” without an actual statistical test, or use a t-test on a setup that violates t-test assumptions, or compare two AUCs without a paired test. As a reviewer, you can and should call this out, and as a writer, just don’t do it.

Sample Size

We all know that more data is better, and in the age of LLMs we’re getting used to models being trained on petabytes of data. That isn’t usually possible in medical AI, but it’s still a good idea to do some calculations up front to figure out how much data you need for meaningful results.

This is another process that’s familiar to clinicians. In a trial, you pre-specify the effect size you want to detect, the alpha and power you’ll accept, and you compute the sample size. Then you collect that many patients. AI evaluation should follow the same logic. Ask in advance how many cases do you need in your test set to estimate sensitivity (or specificity, or AUC) with the precision you want.

Rules of Thumb for Sensitivity and Specificity

For a proportion like sensitivity, the width of a 95% CI is roughly 2 × √(p × (1-p) / N), where p is the sensitivity and N is the number of positive cases. So if you want to estimate a sensitivity of 0.90 to within ±0.03 (a 95% CI of width 0.06), you need roughly N = 4 × 0.9 × 0.1 / 0.03² ≈ 400 positive cases.

That’s a bit daunting. Four hundred positives, Not four hundred total. If your disease prevalence is 5% and you’re sampling consecutively, that means 8,000 total cases. Sensitivity is estimated from positive cases. Specificity from negative cases. If positives are rare, getting tight CIs on sensitivity is expensive.

A common workaround is to enrich the test set with positives by sampling non-consecutively to oversample positive cases. This gives you good sensitivity estimates, but at the cost of distorting the prevalence so that PPV is no longer interpretable at face value. (You’d report sensitivity and specificity on the enriched set, and then translate to PPV/NPV using a known target prevalence.) This is fine and standard, it just needs to be done transparently.

Sample Size for AUC

For an AUC estimate, the asymptotic standard error depends on the AUC value and on the case-to-control ratio. A useful approximation is that to estimate an AUC of 0.85 to within ±0.03 (95% CI width of 0.06), you need roughly 100-200 positives and a similar number of negatives. Most statistical packages can compute this directly given your assumptions.

A Note on Test Set Independence

Everything this article assumes that your test set is independent of your training process. If you tuned hyperparameters on the test set, peeked at the test set during development, or selected your model based on test set performance, the CIs computed on that test set are essentially fiction.

This is the data leakage problem, and it’s also covered in GMLP Principle 4. The statistical machinery is only meaningful if the test set was truly held out.

What to Pre-Register

Borrowing from clinical trials, the most useful thing you can do for the rigor of your AI evaluation is to pre-register the analysis plan before touching the test set.

A minimal pre-registration includes:

Primary endpoint (e.g., sensitivity at a fixed specificity of 0.95)
Pre-specified subgroup analyses (with adequate power calculations)
The statistical test you’ll use to compare against the comparator
The threshold or operating point selection method, decided before seeing the test set
The hypothesis you’re testing (superiority? non-inferiority? equivalence?), with margins where relevant

This is a bit bureaucratic, but it’s how you protect yourself against the soft kind of data dredging that affects almost every AI paper: trying multiple operating points, multiple metrics, multiple subgroups, and reporting the one that looked best. Pre-registration forces you to commit up-front and protects you against subconsicous mistakes in the future. The CONSORT-AI and TRIPOD+AI reporting guidelines call for many of these elements.

What This Looks Like in a Regulatory Submission

The FDA expects:

Confidence intervals on every reported performance metric.
A pre-specified test set, ideally with external validation.
Sample size justification.
Subgroup analyses with appropriate precision and (where statistical comparison is meaningful) correction for multiple comparisons.
Statistical comparison against the predicate or clinical reference, where the device’s claim depends on it.

Key Takeaways

Every metric in section 4 should be reported with a 95% confidence interval. A point estimate alone is incomplete.
For sensitivity and specificity, the precision depends on the number of positives and negatives, not total cases. Rare conditions need huge test sets.
Bootstrap is the right CI method for complicated metrics (AUC, Dice, calibration error) and works for essentially anything you can compute.
DeLong’s method is the standard for AUC CIs and AUC comparisons. Implementation is built into every major stats package.
Multiple comparisons matter. Subgroup analyses and model selection both inflate apparent performance if not corrected for or pre-registered.
Sample size should be calculated before, not justified after. Pre-specify the test set and the analysis plan.
Resampling at the patient level, not the image level, when the same patient contributes multiple measurements. Otherwise CIs are artificially tight.