Cross-Validation Strategies for Medical Data

K-fold, stratified, grouped (by patient, by site), and leave-one-site-out cross-validation. Choosing the right strategy to avoid inflated performance estimates from data leakage.

Why We Cross-Validate in the First Place

You have, let’s say, 2,000 cases. You want to know how well your model will perform on patients it has never seen. The obvious thing is to set aside 400 cases as a test set, train on the remaining 1,600, and report performance on the 400.

This works, but it leaves money on the table. You only got one estimate of performance, and that estimate depended on which 400 cases happened to land in your test set. Worse, you only trained on 1,600 cases, which might be fewer than you really need.

Cross-validation is a way to get more performance information out of the same data. The core idea: split your data into K folds, train on K-1 of them, test on the remaining one, and rotate. Now you have K performance estimates instead of one. Average them and you have a more stable view of how your model performs.

K=5 and K=10 are the common defaults. Leave-one-out cross-validation (where K equals the number of cases) is occasionally used for tiny datasets, but it’s noisy and slow.

This is all standard. Where it gets interesting (and where medical AI keeps getting burned) is in how the folds get constructed.

The Naive K-Fold: Why It’s Almost Always Wrong

The standard scikit-learn function, called with default arguments, does the following: shuffle the cases, then split into K equal groups. Each case goes into exactly one fold. Train on K-1 folds, test on the remaining one.

This works for independent, identically distributed data. Medical data is almost never that.

The problems come from structure in your data that the naive split ignores:

Multiple images per patient. If a patient contributes 8 chest X-rays and you randomly assign each image to a fold, the same patient will appear in both training and testing folds. The model “learns” what this patient looks like from training images and is then tested on more images of the same patient. Performance estimates are inflated.

Multiple slices per scan. If you’re working with CT or MRI volumes and you treat each axial slice as a separate sample, naive splitting leaks information across folds in the same way.

Multiple acquisitions per study. A patient might have a CT in 2022 and a follow-up CT in 2023. If both end up in different folds, you’re testing on the same patient twice.

Site-level structure. Patients from the same hospital often share characteristics that don’t generalize across sites: scanner model, imaging protocols, demographic distribution, documentation patterns. Naive splitting that mixes sites across folds tells you nothing about how well the model generalizes to a new site.

Temporal structure. If you train on data from 2018 and test on data from 2018, you’re not testing the kind of distribution shift the model will actually face when deployed in 2025.

The fix in each case is to do structured cross-validation that respects the dependence in the data.

Stratified K-Fold: The Easy Improvement

For classification problems with class imbalance (i.e., most clinical problems), stratified K-fold ensures that each fold has roughly the same class balance as the overall dataset.

If your overall dataset is 8% positive and you have K=5 folds, each fold should contain about 8% positives. Without stratification, the random shuffle can occasionally produce a fold with 4% positives or 12% positives, and the resulting metrics will be noisier than they need to be.

Stratification is essentially free and almost always a good idea. Most modern cross-validation tools default to it for classification problems. If yours doesn’t, change the default.

Grouped K-Fold: Cross-Validation by Patient

This is the single most important variant for medical data, and the single most commonly overlooked.

In grouped K-fold, every case is associated with a group ID (typically a patient ID), and the splitting is done at the group level. All cases from the same patient go into the same fold. No patient appears in both training and testing folds.

This is what you need whenever you have multiple measurements per patient, which is essentially all of imaging, all of longitudinal EHR data, and most of clinical NLP. Without grouped K-fold, your reported performance is contaminated by patient-level overlap and your “validation” is partly memorization.

A practical example: a dermatology model trained on 5,000 images from 1,000 patients (average 5 images per patient). If you do naive K-fold and the model partly learns “what each patient’s skin looks like,” performance on the held-out fold will be excellent, because the patient was seen during training. The model has not learned dermatology. It has learned patient identity.

Switch to grouped K-fold (patients, not images, distributed across folds), and the model is now tested on patients it has never seen. Performance often drops 5-15% in our experience. That drop is real. The naive K-fold was lying.

GMLP Principle 4 addresses this directly: training and test data must be independent, and “independent” means independent at the relevant level of clinical grouping, not just at the level of individual files.

Stratified Grouped K-Fold: Combining Both

You can have both stratification (by class) and grouping (by patient) at the same time. Most modern statistical packages support this. For nearly every clinical AI problem, this is the right default:

Split by patient (no patient appears in two folds).
Within those splits, keep the class balance roughly equal.

If your tooling doesn’t support this directly, it’s easy enough to do by hand: stratify patients into class buckets first, then assign patients to folds maintaining balance. The result is a set of folds that respect both the structure (one patient = one fold) and the class distribution.

Leave-One-Site-Out: The Gold Standard for Multi-Site Studies

If your data comes from multiple hospitals or imaging sites, the cross-validation question becomes more interesting. You want to know whether the model generalizes across sites, which is a stronger requirement than generalizing across patients.

In leave-one-site-out cross-validation, each fold corresponds to all the data from one site. You train on all sites except one and test on the held-out site. This is a much harder test, because the model can’t lean on site-specific quirks during training.

A model that performs at AUC = 0.92 in grouped K-fold and AUC = 0.74 in leave-one-site-out has learned a lot about site-specific patterns. That’s not necessarily a disaster; it’s information. But if your intended use is “this model should work in any participating hospital,” only the leave-one-site-out result is actually relevant.

This is why the FDA loves multi-site external validation and why submissions that rely on internal cross-validation often draw deficiency letters. Leave-one-site-out is the closest thing to external validation that you can do within a single dataset.

Temporal Cross-Validation: For Anything That Drifts Over Time

A model trained on 2018 imaging data will face 2025 imaging data when deployed. Scanner models have changed, protocols have changed, patient populations have shifted. Random K-fold over a multi-year dataset ignores all of this.

Temporal cross-validation respects time. Train on years 2018-2021, test on 2022. Or train on the first 80% of dates and test on the most recent 20%. The test set is always later than the training set.

This is the right approach when:

The relationship between inputs and outputs might drift over time (concept drift, in the language of 6.12).
The clinical practice that generates your labels might have evolved (different documentation standards, new diagnostic criteria, etc.).
You want to estimate how the model will perform on future data, which is what deployment will actually involve.

For chronic-disease prediction, sepsis prediction, EHR-based models, and many imaging models, temporal cross-validation reveals fragility that random cross-validation will not. A model that drops from AUC 0.88 (random K-fold) to 0.71 (temporal split) has a real generalization problem you need to understand before deployment.

Nested Cross-Validation: For Hyperparameter Tuning

This is where it gets a bit hairy, but the consequence of skipping it is one of the biggest causes of optimism in published AI work.

If you tune hyperparameters using cross-validation performance, you have effectively “trained” on the cross-validation results. The model’s choice of hyperparameters was made by looking at the data that you’re now reporting performance on. The reported cross-validation accuracy is biased upward.

Nested cross-validation fixes this:

An outer loop estimates final performance.
An inner loop within each outer fold does hyperparameter tuning.
Hyperparameters chosen on the inner folds are then evaluated on the outer fold the model hasn’t seen.

This is more computational work (effectively K × K model fits instead of K), but it’s the only way to get an honest estimate of model performance when hyperparameter tuning is part of the process.

In practice, many papers skip nested CV and either (a) hold out an entirely separate test set for final evaluation (which is the simpler and usually-preferable alternative), or (b) just don’t worry about it and report optimistically biased numbers. Reviewers should ask which one is happening.

Cross-Validation vs. Held-Out Test Set: Which When?

Cross-validation is useful for model development: tuning hyperparameters, comparing architectures, estimating performance variability. It is not a substitute for a held-out test set in a serious clinical AI study.

The pattern that works:

Split off a true test set early and don’t touch it. Ideally, this comes from a different time period, a different site, or both.
Use cross-validation on the remaining data for development decisions: hyperparameter tuning, model selection, ablations.
Compute final reported performance on the held-out test set, once. Not twice. Not “we made one small tweak and re-ran.”

This separation is what GMLP Principle 4 is about, and it’s also the foundation of 3.2 Training, Validation, and Test Sets. Cross-validation belongs in the development phase. The test set is your one-shot honest measurement.

A Worked Example: The Same Data, Five Different Stories

Imagine 10,000 chest X-rays from 1,500 patients across 4 hospitals, collected from 2018-2023, with a 12% positive rate for a particular finding.

Here are the five cross-validation schemes you could apply, and what each one tells you:

Naive K-fold: 10,000 images shuffled and split. AUC = 0.93. This is contaminated by patient-level and possibly site-level leakage. Treat as upper bound.
Stratified K-fold: Same as above, with class balance preserved. AUC = 0.93 with slightly tighter CI. Still contaminated.
Grouped K-fold (by patient): AUC = 0.87. This is closer to honest. The 6-point drop is the size of the patient-identity leakage.
Leave-one-site-out: AUC = 0.79. The 8-point further drop is the size of the site-specific learning. If you deploy to a new hospital, this is a more realistic expectation.
Temporal split (train 2018-2021, test 2022-2023): AUC = 0.81. Slightly better than leave-one-site-out, but reveals some drift over time. If you deploy in 2026, the relevant number is probably between this and the site-out estimate.

A paper reporting only number 1 looks like a 0.93 model. A paper reporting number 5 looks like a 0.81 model. Same data, same model, four-fold difference in apparent quality. The cross-validation scheme is doing as much work as the model.

When you read an AI paper, what cross-validation scheme they used is at least as important as the AUC they reported.

What Regulators Look For

For a 510(k) or De Novo submission, the test set has to be genuinely held out, and the FDA’s expectation is closer to leave-one-site-out or temporal validation than to internal random splits. Most modern submissions for imaging AI use:

Internal cross-validation (grouped by patient) during development.
An entirely separate, multi-site, prospectively collected (or at least retrospectively curated from sites not seen during development) test set for primary performance claims.

GMLP Principle 3 (representative datasets) and Principle 4 (independent test sets) both point in this direction. A submission that uses naive random K-fold as the primary evidence will be questioned hard.

Common Mistakes Worth Calling Out

A few that we see in submitted papers and in submitted regulatory packages, in rough order of frequency:

Image-level rather than patient-level splits. The single most common mistake. Almost any imaging AI paper that doesn’t explicitly mention grouped K-fold has probably gotten this wrong.

Hyperparameter tuning on the test set. Sometimes accidental (the team “checked” the test set during development), sometimes deliberate (“we picked the best of three model versions on the test set”). Either way, the reported performance is optimistic.

Re-running with different random seeds and reporting the best. This is data dredging for variance. The right thing to do is report mean and standard deviation across seeds, or use a single pre-specified seed.

Mixed-site folds when sites differ systematically. If site A is a pediatric center and site B is a geriatric one, mixing them across folds will produce splits that are easy in artificial ways and won’t generalize.

Tiny inner folds in nested CV. If your inner cross-validation has 30 samples per fold, the hyperparameter selection is noisy and the outer estimate becomes hard to trust. Increasing the data size helps; if that’s impossible, simplify the model.

Key Takeaways

Cross-validation by patient (grouped K-fold) is the right default for any dataset with multiple measurements per patient. Naive random K-fold leaks information.
Leave-one-site-out cross-validation is the closest thing to external validation you can do without a separate dataset.
Temporal cross-validation matters whenever the data-generating process changes over time. Almost always, for medical data.
Nested cross-validation is the honest way to combine hyperparameter tuning with cross-validated performance estimates. A held-out test set is the simpler alternative.
Cross-validation belongs in development, not as your primary evidence. A held-out test set is your one-shot honest measurement, and the FDA expects it.
Different cross-validation strategies on the same data can produce wildly different apparent performance. The scheme matters as much as the result.