Post-Market Surveillance and Real-World Monitoring

Your obligations after clearance: adverse event reporting, complaint handling, performance monitoring, and the growing expectation for real-world performance data.

After you get FDA clearance, feel free to celebrate. For a little while, anyhow. Unfortunately your life will never be an FDA-free zone: clearance is phase one in an infinite series of phases. After clearance, regulators shift into oversight mode. They will watch what happens when your AI model meets thousands of real patients, messy data, and clinical workflows you never anticipated in your validation study.

This should make sense to a scientist because we know AI models change when they hit the real world: maybe Samsung ships a firmware update for their MRI machines that subtly changes the output images, or maybe you get used in Alabama and your test set underrepresented the demographics of that region. In any case, there’s a legal requirement written into 21 CFR Part 806 (corrections and removals), and 21 CFR Part 820 (quality system regulations) that you track and report on these changes. For AI devices, it’s also the only way you’ll know whether the model you validated in a clean dataset continues to perform when deployed.

For most devices this is tracked by monitoring complaints and having a way for users to report defects. But for AI models the challenge is a bit more thorny since you cannot unbox a neural network and inspect it. You cannot look at the “defect” the way you can see a crack in a metal stent, and performance degradation may be invisible to individual clinicians until it causes a wrong diagnosis. By then, you need to have already detected the shift through statistical monitoring. That requires infrastructure, discipline, and a mindset shift, from thinking of deployment as “we’re done” to thinking of it as “we’re .”

What Post-Market Surveillance Must Include for AI Devices

The FDA expects post-market surveillance to be proportional to risk, but for any AI-based device with non-trivial clinical impact, that typically includes:

Performance Monitoring: You must measure your model’s accuracy, sensitivity, specificity, or other relevant metrics against a ground truth comparison in the real world. Ideally you would do this continuously, or at least at regular intervals. You need to know whether the model’s performance against real cases matches what you proved in your 510(k) submission.

For a diagnostic AI, that means collecting predictions alongside radiologist confirmations, pathology assessments, or clinical outcomes. For a treatment recommendation system, it means tracking whether clinicians followed your recommendations and what the clinical results were.

This ends up being a deployment and logistic issue. If you’ve set up a deployment system where your computers are handling all the inference remotely, you’re golden. But if you’re deploying an on-prem solution you need to build in a way to get some sampling of the predictions out of those deployments. Then on the logistics side you need to have a way to contact physicians to sample what they thought of your product’s recommendations, and track any changes to that metric over time.

Distribution Shift Detection: Your validation dataset was a snapshot of the messy, messy chaos of the real world. In the real world patient populations change, imaging equipment gets upgraded, and clinical workflows shift. Your model’s performance may degrade without warning because the input data distribution has drifted from what the model learned on. Your goal is to detect that drift before it erodes clinical performance.

Subgroup Performance Monitoring: Building on SGBA Plus (and increasingly, FDA expectations), you should track whether the model performs equally across demographic subgroups like age, sex, ethnicity, socioeconomic status, disease stage, and any other variable that might reveal disparate performance.

Feedback Collection from Clinicians: You should set up structured channels for users to report not just adverse events, but also edge cases, confusing outputs, or moments where the system’s recommendation surprised them. Adverse event reporting (Part 803) is mandatory for deaths, serious injuries, or malfunctions. But you should also consider setting up channels for physicians to report when your model’s predictions didn’t match their clinical intuition, when your report was just dumb, etc. In the best of all worlds this can be built right into your product so doctors don’t have to take an extra step to go to your website: a big “report problem” button might be scary to the marketers but is probably good for your product and is certainly good for your regulatory compliance.

Complaint Handling and Investigation: Every complaint needs investigation and documentation. Was it a data entry error? A truly misclassified image? A model failure? A user misunderstanding of the output? You need to trace complaints back to the model version, training data, and deployment configuration that produced them.

Periodic Summary Reporting: FDA expects you to file Medical Device Reports (MDRs) for serious events, but you also need to file periodic summary reports (PSRs) at intervals specified in your clearance letter (often annually). These summarize your overall post-market data, including volume distributed, adverse events, complaints, corrective actions, and any performance trends you’ve detected.

The Unique Challenge for AI: Invisible Degradation

A physical device fails in ways you can point at. A fracture in the casing, or a flaking coating, or a broken spring. You can inspect the device itself (and throw it out the window if necessary, as engineering therapy).

An AI model’s failure is typically silent. The model makes a prediction, a clinician sees the prediction, the clinician accepts it, or maybe ignores it, or maybe rolls their eyes and rants to their colleague. If the prediction was wrong, but the clinician caught it, no adverse event occurs but you also never hear about it.

This creates a surveillance trap. By the time you learn about model failure through a traditional “adverse event”, the problem may have already affected many patients. Clinicians are generally competent, but they’re also busy. They may not catch every mistake, especially if the model’s errors are subtle or confidence scores are misleading.

This is why you need active monitoring, not passive complaint collection. You cannot wait for problems to bubble up through adverse event reports: you need to actively measure the model’s performance in deployed settings, compare it to expectations, and flag degradation statistically before it becomes clinically significant. This is a lot of effort (and money).

Building the Infrastructure: Input/Output Monitoring and Statistical Process Control

So what does “active monitoring” look like in practice?

Log every prediction: Your system should capture the model’s input, output (prediction), confidence or probability score, any ground truth or clinician feedback collected later, and metadata (timestamp, patient demographics, institution, model version). This is a large volume of data, but if you do it you pretty much have everything you need to track performance.

Compare predictions to ground truth systematically: For diagnostic models, this might mean daily or weekly comparison of predictions against verified diagnoses. For models that recommend next steps, it might mean tracking clinical outcomes against what the model predicted. The cadence depends on volume and risk, but you need a schedule.

Apply statistical process control: Don’t just look at raw accuracy numbers. Use control charts (e.g., cumulative sum charts, moving average charts) to detect statistically significant shifts in accuracy, sensitivity, or specificity.

Monitor input distributions: Collect summary statistics on the inputs your model receives (e.g., image characteristics for imaging AI, lab value ranges for diagnostic reasoning tools). If the distribution of inputs shifts, you may see performance shift too. Detecting input drift tells you where to investigate.

Segment by subgroups: Don’t aggregate all predictions into one performance metric. Break it down by age, sex, ethnicity, institution, imaging device, or other relevant dimensions. If the model’s accuracy is 92% overall but 78% in a specific demographic, you have an equity problem that raw metrics hide.

These recommendations are an ideal case, and ignore the very real business challenges around data sharing, privacy, anonymization, and so on. Have these considerations in mind when you draft your usage conditions and client contracts.

Data Drift vs. Concept Drift: What’s Actually Changing?

These two terms get thrown around a lot, so let’s define them:

Data drift (or covariate shift) means the distribution of inputs has changed, but the underlying relationship between inputs and outputs is the same. For example, your model was trained on imaging scans from Random Vendor A’s equipment, but you’re now deploying in hospitals with Random Vendor B’s equipment, which has different noise and resolution characteristics. The inputs have shifted, but a diagnosis (say, “nodule present”) still means the same thing.

Concept drift means the underlying definition or relationship has changed. For example, your model was trained on one population’s definition of what constitutes a “positive” screening result, but in a new population with different disease prevalence or risk factors, the same imaging finding implies different clinical action.

Your monitoring system needs to watch for both. You are tracking inputs, and if you see the data distributions shifting, that’s data drift. On the other hand if the inputs seem stable but the model accuracy is degrading, that might be concept drift. (Or it might point to other issues like ground truth labeling changing, clinical workflow changing, etc.)

The PCCP Connection: Monitoring Feeds Improvement

If you submitted a Predetermined Change Control Plan (PCCP) as part of your 510(k), all this post-market data is the information you need to implement that plan. The PCCP allows you to deploy certain modifications to the model (retraining on new data, algorithm parameter tuning, etc.) without a new 510(k), provided those changes fall within your pre-defined “allowable modifications.”

But the PCCP is not carte blanche to mess with your model the way you did during development. It requires evidence that the change is safe and maintains performance, and the post-market surveillance data you’re collecting provides that evidence. You show FDA, “Here’s the deployed model’s performance in real world conditions. Here’s the retrained model’s performance on the same real-world data. Performance is maintained or improved. The change is within my PCCP boundaries.”

The Feedback Loop: Closing the Circle

The healthiest organizations close the loop:

Deployment: Model goes live. Infrastructure logs predictions, inputs, ground truth, clinician feedback.
Monitoring: Monthly or quarterly analysis. Accuracy stable? Any subgroup disparities? Any input drift?
Detection: Quarterly summary shows a 3% drop in sensitivity for a specific demographic, correlated with a new imaging device at a major customer site.
Investigation: Dig into the data. What changed? Is the model failing, or has the clinical population changed? Is the ground truth label consistent?
Improvement: Retrain the model on recent data including the new device. Validate on a held-out test set. Update the PCCP submission.
Revalidation: Deploy the improved model. Monitor new performance metrics. Is the drift corrected?
Back to Deployment: The improved model goes live. The loop continues.

Practical Advice: Build First, Deploy Second

By now if you’re not overwhelmed by all the regulatory requirements you’re facing, you need to go back and read this section from the beginning. But despite the volume of requirements, the overall message is for you to think about these issues during development. For post-market surveillance, you can think about it during development by building reporting and monitoring features into the end product.

Ask yourself now:

Where will I capture inputs and outputs in my deployment environment?
What ground truth data can I reliably collect, and on what timeline?
How will I handle privacy concerns (HIPAA, GDPR) in storing real-world predictions alongside patient identifiers?
Who owns the responsibility for monthly analysis? What skill sets are needed?
What are my red lines? At what performance threshold do I escalate to leadership or regulators?
How do I make this sustainable? A homegrown spreadsheet won’t scale to thousands of predictions per week.

Build the infrastructure, test it on historical data, make it part of your product before you ship.

Key Takeaways

Post-market surveillance is a legal requirement (21 CFR 803, 806, 820) and is essential for AI devices, where performance degradation can be invisible to individual clinicians.
Active monitoring beats passive complaint collection: Log predictions, inputs, ground truth, and clinician feedback. Don’t wait for adverse events to surface problems.
Use statistical process control to detect meaningful performance shifts, separate signal from noise, and flag degradation before it’s clinically significant.
Monitor subgroups separately: Overall accuracy can hide disparate performance in specific demographics. SGBA Plus is not just a regulatory box; it’s essential surveillance.
Data drift and concept drift require different investigations: Input distribution shifts may indicate a retraining opportunity. Underlying relationship changes require deeper thought.
PCCP leverages post-market data: Your monitoring evidence funds future modifications. Build the surveillance system to feed your improvement cycle.
Design post-market infrastructure before deployment: Set up logging, ground truth collection, and analysis workflows during development. Test on historical data. Make it part of your product.