Segmentation Metrics: Dice, IoU, and Volumetric Measures

Evaluating models that outline or delineate structures (tumors, fluid, anatomical regions). Dice similarity coefficient, intersection over union, Hausdorff distance, and when each is appropriate.

What Segmentation Is, and Why It Needs Its Own Metrics

A classification model assigns a label to an image: pneumonia or not. A segmentation model assigns a label to every pixel: these pixels are tumor, these pixel over here are liver parenchyma, those pixels are background. The output is called a mask, the same size as the input image, where each pixel has been classified.

You could in principle evaluate this with classification metrics applied pixel by pixel. People did try to do that, and the result wasn’t great. Pixel-level accuracy for a small tumor in a large image is dominated by background pixels, so a model that predicts “background” for every pixel in a chest CT will be 99.5% accurate by pixel count. It’s the same class imbalance trap from 4.1, now operating at the pixel level.

So segmentation gets its own family of metrics. They’re all variations on the same theme: how much does the predicted mask overlap with the ground truth mask?

Dice Similarity Coefficient: The Default

The Dice coefficient (also called the Dice-Sørensen coefficient, after a Danish ecologist and a British botanist who developed it independently in 1948 for measuring ecological communities) is the standard segmentation metric you’ll see in essentially every paper.

Dice = 2 × |A ∩ B| / (|A| + |B|)

In more words, take twice the area of overlap between the AI prediction and the ground truth, and divide it by the sum of the areas of each. (The factor of 2 is just to make it come out between 0 and 1 rather than 0 and 0.5).

A Dice score of 1.0 means perfect overlap. A score of 0 means no overlap whatsoever. Real models typically land in the 0.7-0.95 range for tractable problems, lower for harder ones (small lesions, ambiguous boundaries, low-contrast structures).

Dice has a few nice properties: it’s symmetric (it doesn’t matter which mask you call A and which you call B), it’s bounded, and it penalizes both false positives (predicted tumor where there isn’t one) and false negatives (missed tumor) reasonably symmetrically. But there are also a couple of problems to be aware of.

Intersection over Union (IoU): The same but different

The IoU is also called the Jaccard index, after another botanist (this time Swiss) who came up with it in 1901.

IoU = |A ∩ B| / |A ∪ B|

Again we take the overlap between the prediction and the ground truth, but this time divide by the union of the two. IoU is always lower than Dice for any imperfect overlap, and the two are related by a simple formula.

IoU = Dice / (2 - Dice)

So a Dice of 0.9 corresponds to an IoU of about 0.82. They are functionally interchangeable: any paper that uses one could have used the other and the relative ranking of models would not change. Computer vision papers tend to prefer IoU; medical imaging papers tend to prefer Dice. We mostly bring it up so you can read either kind of paper without confusion.

What Dice Doesn’t Tell You

Just like AUC, Dice is a single-number summary, and single-number summaries hide things.

Issue 1: Dice Is Size Dependent

If you’re segmenting a 5-pixel-wide blood vessel and the model is off by 1 pixel everywhere, you can score a Dice as low as 0.5 even though clinically you’ve done a perfectly fine job. Conversely, if you’re segmenting a large organ and you miss a 2mm rim everywhere, your Dice will still be 0.95 even though you’ve systematically under-measured the organ.

The size of the structure matters enormously to how stringent Dice actually is. A Dice of 0.85 on a 50-pixel lesion is excellent. A Dice of 0.85 on a 5,000-pixel liver is mediocre at best. This is one reason that comparing Dice scores across different segmentation tasks (or across different lesion sizes within one task) is problematic.

Issue 2: Dice Doesn’t Care About Shape

A 1-pixel-wide error sprinkled randomly throughout the boundary scores about the same on Dice as a 1-pixel-wide error concentrated on one edge. But clinically, those errors can be very different. If a tumor segmentation has a 2mm “wobble” on a clinically irrelevant edge (say, the side facing healthy tissue with no plans for resection), that’s fine, while the same 2mm error along the boundary with a critical adjacent structure (a major blood vessel, the spinal cord) might be a real problem. Dice cannot distinguish these.

Issue 3: Dice Is Optimistic at the Extremes

This is pretty much the same as issue 1 but with a different emphasis. For very small objects (a few pixels), Dice becomes noisy and unreliable since if the model and the annotator disagree by 1 or 2 pixels the Dice swings wildly. For very large objects (an organ filling most of the image), Dice converges toward 1.0 even for models that miss meaningful boundaries or chunks of the organ.

Dice is most informative for medium-sized structures with well-defined boundaries. Tumors, lesions, organs of intermediate size. Outside that sweet spot, it’s good to give a little less weight to this metric.

Hausdorff Distance: How Far Off Are the Boundaries?

One metric that tries to address areas where Dice is weak is the Hausdorff distance, which instead of comparing the entire segmentation, compares the boundaries of the segmentation. To get the Hausdorff Distance, you pick a pixel on the boundary of the prediction mask. You then find the closest pixel on the boundary of the ground truth and find the distance. Repeat this for every pixel on the boundary of the prediction, and remember the largest value. Pick any pixel on the boundary of the predicted mask. Find the closest pixel on the boundary of the ground truth mask. Measure that distance. Repeat for every boundary pixel. Take the largest distance.

Then do the same starting from the ground truth boundary. The larger of these two numbers is the Hausdorff Distance. The Hausdorff Distance is the farthest you have to travel from any point on one boundary to get to the other boundary. (Note that this distance can be calculated for any two sets of points, they don’t have to be a boundary.)

What this captures that Dice does not is the worst boundary error. A 2mm error along most of the boundary plus one spot where the model is off by 20mm will produce a Dice score that looks fine, and a Hausdorff distance of 20mm that tells a diffent story.

For surgical planning, radiotherapy contouring, or any application where boundary accuracy in a specific location matters more than aggregate overlap, Hausdorff is much more informative than Dice. Most thoughtful papers now report both. The recommendation in GMLP Principle 8 about “clinically meaningful metrics” tilts in favor of including Hausdorff for boundary-sensitive applications, because radiologists and surgeons think in terms of “how far off?” rather than “what fraction overlap?”

Sometimes papers will report the 95th percentile Hausdorff distance (HD95) reports the boundary distance below which 95% of points fall, instead of the maximum. This is more robust to a single outlier pixel and is increasingly the default in modern segmentation papers.

Volumetric Similarity and Volume Error

For 3D structures (tumors in CT, lesions in MRI, organs), the thing clinicians are often most interested in is the total volume of the predicted segmentation. A radiation oncologist planning treatment for a brain tumor doesn’t really care about pixel-level overlap. They care about how big the tumor is and where its center of mass is.

If a problem falls into this category, there are a few useful volumetric metrics:

Relative volume error: (predicted volume - true volume) / true volume. Positive means the model over-segments; negative means it under-segments. This number is signed, with a negative number indicating that the model consistently under-estimates volume.

Volumetric similarity: 1 - |predicted volume - true volume| / (predicted volume + true volume). This is similar to Dice but it is volume-only, ignoring spatial overlap entirely. Two completely non-overlapping masks of the same volume will score perfectly. That’s a feature, not a bug, if what you care about is volume estimation rather than spatial localization.

Per-Class and Per-Structure Reporting

Most clinical segmentation problems involve multiple classes: background, organ A, organ B, tumor within organ A, etc. The same multi-class considerations from 4.1 apply here.

A single “mean Dice across all classes” can hide that the model is excellent at the easy classes (kidneys, liver, spleen) and terrible at the hard ones (small adrenal masses, lymph nodes). A good report will not average Dice across multiple classes.

A Note on the Reference Standard

Segmentation metrics are computed against a ground truth mask. That mask was drawn by a human, or several humans. Two excellent radiologists segmenting the same lesion will not produce identical masks. Their Dice agreement with each other is often in the 0.85-0.92 range for tumor segmentation, depending on the structure and modality.

This sets a hard ceiling on what your model can achieve against any single annotator’s mask. If two humans only agree at Dice = 0.88, a model that scores Dice = 0.85 against one specific human is essentially performing at human level. Interpreting that 0.85 that as the model only segmenting 85% of the structure misses the point.

The right way to handle this is to compare the Dice to a human-to-human Dice.

Use consensus segmentations or adjudicated ground truth for the test set (see 5.7 Reference Standards).
Report inter-annotator agreement on the same test set. This is your ceiling.
Frame model performance as “achieved Dice X, where inter-annotator Dice is Y.”

An implication of this human-agreement ceiling is that a paper that reports a Dice of 0.91 against a single annotator’s labels, with no inter-annotator analysis, is leaving out an essential piece of context.

A Worked Example, Because Pictures Help More Than Words Here

(TODO: Get some pictures!!!) Imagine three models segmenting a 100-pixel tumor in a 1000-pixel image. Ground truth is exactly 100 pixels, in a roughly circular shape.

Model A: Predicts a 100-pixel mask, overlapping the truth by 90 pixels. 10 pixels are inside the truth but missed; 10 pixels are predicted outside the truth. Dice = 0.9. Volume error = 0%. Hausdorff = ~2 pixels.

Model B: Predicts a 100-pixel mask, but rotated 30° around the center of the tumor. 80 pixels overlap. Dice = 0.8. Volume error = 0%. Hausdorff = ~6 pixels.

Model C: Predicts a 200-pixel mask that fully contains the truth (so the truth is 100% covered). Dice = 0.67. Volume error = +100%. Hausdorff = ~5 pixels.

These three models have three different “stories”:

Model A is clinically excellent: right size, right location, small boundary wobble.
Model B has the right size but is poorly localized. Whether this matters depends on the application.
Model C is the worst on Dice but is “safe” in a radiotherapy contouring sense (it covers all the truth). On volume estimation, it’s terrible (a 2× overestimate).

A paper reporting only Dice would rank these as A > B > C. A paper reporting Dice and volume error and Hausdorff would tell a much richer story, and the right choice between B and C would depend on whether you care more about localization (B is worse) or volume estimation (C is much worse).

What Regulators Look For

For an AI/ML medical device that produces segmentations, the FDA expects:

The primary segmentation metric (usually Dice or IoU), with confidence intervals.
Boundary distance metrics (Hausdorff or HD95) when boundary accuracy matters clinically.
Volume error metrics when volume is the clinically actionable output.
Per-class and per-structure breakdowns.
Subgroup breakdowns when relevant (Principle 8 of GMLP).
Inter-annotator agreement on the test set, as a ceiling.
A clear explanation of how the reference standard was constructed (see 5.7).

Key Takeaways

Dice and IoU are the standard segmentation metrics. They measure overlap and are functionally interchangeable.
Dice is harshest on small structures and most forgiving of large ones. Comparing Dice across structures of different sizes is a trap.
Hausdorff distance captures the worst boundary error. Use it for any application where boundary accuracy in a specific location matters.
Volume error matters when volume is what’s clinically actionable. Don’t make a reader infer volume accuracy from Dice.
Inter-annotator agreement sets the ceiling. Higher Dice scores than this are just chance.
Per-class metrics are essential for multi-class problems. Aggregates hide failures on the hard classes.