Evaluating Model Performance
Technical metrics for assessing how well an AI model performs its intended task. This section gives clinical researchers the vocabulary to critically evaluate published results and their own models.
Classification Metrics: Sensitivity, Specificity, and Beyond
Accuracy, sensitivity (recall), specificity, positive predictive value, negative predictive value, and why accuracy alone is almost always misleading in medical applications.
4.2The ROC Curve and AUC: What They Tell You and What They Hide
How to read and interpret receiver operating characteristic curves, what AUC actually measures, and the important limitations — including why a high AUC does not guarantee clinical utility.
4.3Segmentation Metrics: Dice, IoU, and Volumetric Measures
Evaluating models that outline or delineate structures (tumors, fluid, anatomical regions). Dice similarity coefficient, intersection over union, Hausdorff distance, and when each is appropriate.
4.4Calibration: Does a 90% Confidence Score Really Mean 90%?
Why model confidence scores are often poorly calibrated, why this matters for clinical decision-making, and how to measure and improve calibration.
4.5Statistical Significance, Confidence Intervals, and Sample Size for AI Studies
Applying rigorous statistical methodology to AI evaluation. Confidence intervals for AUC, bootstrapping, multiple comparisons, and sample size calculations for performance studies.
4.6Cross-Validation Strategies for Medical Data
K-fold, stratified, grouped (by patient, by site), and leave-one-site-out cross-validation. Choosing the right strategy to avoid inflated performance estimates from data leakage.
4.7Comparing Models: When Is One Model Truly Better Than Another?
Statistical tests for comparing classifier performance (McNemar's test, DeLong's test for AUC comparison), and avoiding the pitfall of selecting models based on noise in the test set.