Good Machine Learning Practice (GMLP): The 10 Guiding Principles

The FDA/Health Canada/MHRA consensus principles for AI/ML medical device development. What each principle means in practice and how to build compliance into your development process.

In 2021, the FDA, Health Canada, and the UK’s MHRA released a joint guidance document on Good Machine Learning Practice. It’s not The Law and there’s no jail time for violating it. But it’s the closest thing we have to a shared consensus about what “responsible ML in medical devices” looks like, and regulators increasingly expect it.

If you read through the 10 principles, most of them feel less like regulation and more like good science. If you’re already running clinical trials right, if you’re already thinking about data quality and generalization, if you’re already documenting your decisions, you’re most of the way there.

1. Multi-Disciplinary Expertise: You Need More Than Data Scientists

Your ML model is being built by your data science team. That’s necessary but wildly insufficient.

What this means in practice: You need clinical expertise (someone who knows the disease, the current diagnostic or treatment approach, the clinician’s workflow), statistical expertise (someone who understands power, bias, confounding, validation), software engineering expertise (someone building production-ready code, not Jupyter notebooks), and regulatory or quality expertise (someone who knows compliance, documentation, traceability).

If you’re a startup of three data scientists, or on the flip side a endcrinologist with a great AI idea, this seems impossible. But “expertise” doesn’t always mean a full-time employee. It means you have access to someone who can challenge your choices. It means a cardiologist looks at your training data and says, “Wait, this doesn’t reflect how we actually manage STEMI in our hospital.” It means a biostatistician reviews your cross-validation strategy and spots the data leakage issue you missed.

For many academic teams, this is your normal department structure. For startups, it’s your advisory board or your regulatory consultant.

Why it matters: ML models have invisible failure modes. A data scientist might think a feature is predictive when it’s actually a proxy for data quality issues specific to one hospital system, which is something that a clinician would catch. A statistician might see that your model performance in subgroups differs dramatically and flag generalization risk. A software engineer might notice you’re training and validating on data from the same source, which violates independence assumptions.

Practical checklist:

Do you have a clinician reviewing your training data and intended use for plausibility?
Is a statistician reviewing your study design, validation approach, and power calculations?
Is your code reviewed for security, logging, and production readiness, not just bugs?
Is someone (you, a consultant, an advisory board) making sure you’re meeting documentation and traceability standards?

If the answer to any is “no,” you have a knowledge gap to fill.

2. Good Software Engineering and Security Practices: Your Model Is Software

Good software engineering practices aren’t that relevant to most research software, but the FDA expects them throughout the model development process. You’ve built a beautiful ML pipeline in Python. It’s modular, it’s reproducible, it’s validated. Then you try to deploy it, and suddenly you’re thinking about Docker containers, dependency management, access logs, encryption, code signing, and a hundred other things that make you miss Jupyter.

What this means in practice: Your ML code needs to follow software engineering standards as rigorously as any other medical device software. This includes version control (every change tracked and traceable), code review (another engineer looks at your code before it goes to production), automated testing (you run tests every time code changes), logging and monitoring (you know what the model is doing in the field), and security hardening (access controls, data encryption, secure communication).

For the regulatory pathway you’re pursuing, the FDA will ask for evidence of this. They’ll ask to see your software development plan, your change control process, your testing strategy, your deployment architecture. If your answer is “we have version control in GitHub and we manually test before deploying,” you’re going to get deficiency letters.

Why it matters: A model with a data quality drift bug is a model that silently produces wrong answers in production. A model without access logs can’t prove what decision it made for which patient. A model without encryption during data transmission can leak training data. Security and engineering discipline aren’t nice-to-haves when people’s health is on the line.

Practical checklist:

Is all code in version control with meaningful commit messages and traceability?
Is there a code review process before code merges to your main branch?
Do you have automated tests that run on every code change?
Do you have logging in production that captures inputs, model versions, predictions, and key decision points?
Is your codebase documented so someone else could maintain it?
Is sensitive data encrypted in transit and at rest?
Do you have a documented process for updating models and deploying changes?

If you’re a startup running on cloud infrastructure, a lot of this can be handled by your cloud provider (managed logging, encryption, access control). But you need to know what you’re relying on.

3. Representative Clinical Study Participants and Datasets: Generalization Doesn’t Happen by Accident

Your model performs beautifully on your training data. Then you deploy it and performance drops 16%. This generalization problem is so common for AI models that it’s got a name: the Stanford problem. If your entire training set consists of Stanford medical students who have to participate in a research study, it may not work as well when deployed to a hospital in Minnesota.

What this means in practice: The patients in your training dataset must represent the patients in your intended use population. If you’re building an AI for retinopathy screening in sub-Saharan Africa, your training data needs people from sub-Saharan Africa (different retinal imaging equipment, different disease prevalence, different patient factors). If you’re building for the US, you need adequate representation of racial and ethnic groups that exist in the US, older patients, younger patients, patients with comorbidities.

This also means the clinical settings need to be representative. If you train on fundus images from a tertiary care ophthalmology center and then deploy in a primary care clinic with older imaging equipment, you’ll have problems.

Why it matters: ML models are excellent at fitting the specific data they see. If all your ultrasounds come from one Phillips machine, the model learns features specific to that machine. If your training population is 80% male and 20% female (and women have different disease presentation), your model will perform worse on women.

This is especially acute in medical AI because disease and patient presentation vary by demographics, geography, and healthcare systems. Generalization is hard.

Practical checklist:

Does your training dataset include the demographic groups (age, sex, race/ethnicity) represented in your intended use population?
Do you have data on how model performance varies by subgroup? (You do, because you measured it in validation, right?)
Is your training data from imaging systems, labs, or settings similar to where the device will be deployed?
If deploying internationally or to new healthcare systems, do you have validation data from those settings?
Can you describe your intended use population explicitly? (Not “people with diabetic retinopathy” but “adults with type 2 diabetes, aged 40-80, with or without known retinopathy, in primary care or diabetic clinics in the US, using digital fundus photography.”)

The more specific you can be about intended use, and the more your training data matches that description, the better your generalization will be.

4. Training Data Independence From Test Sets: Don’t Peek at the Answer Key

This is basic science, but it’s worth emphasizing because it’s so easy to mess up in ML.

What this means in practice: Your training set, validation set, and test set must be truly independent. Not just different patients, but data collected independently. Ideally, data from different time periods, different locations, or different data sources. If you do have to split only one dataset, you should try to make sure both sets are diverse (explained in point 3 above).

Here’s a common mistake an AI newbie might make. You learn enough Python to train a segmentation model. As part of your code, you split the data into training and test sets, do a bunch of training, pay Amazon a bunch of money for the GPUs, and test the model on your test set. It’s not great, so you adjust the architecture and fiddle with some hyperparameters, then do it again: split into training/test sets, pay Amazon, and test the model.

You probably spotted it: you’re leaking between training runs because scans that were in your first training set are now in your test set. Given enough training runs, everything will be training data and your AI model will just be memorizing the data.

Why it matters: If you’ve trained on your test set (even implicitly, by doing hyperparameter tuning on it), you’ve overfitted to that specific data. Your reported performance is optimistic and deployment performance will be worse. This also matters for regulatory review. The FDA expects to see three independent datasets: one for training, one for validation (used during development to tune the model), and one for testing (held out, used to generate your primary performance claims).

Practical checklist:

Do you have three explicitly separate datasets (training, validation, test)?
Is the test set held completely separate until final evaluation? (No hyperparameter tuning on test data, no peeking.)
Is there temporal or geographic or site independence between training and test if possible?
Can you describe exactly which patients/records went into training vs. test?
Have you used cross-validation or other resampling to estimate performance, and reported both cross-validated performance and held-out test performance?

5. Reference Datasets Based on Best Available Methods: Your Gold Standard Needs to Be Defensible

Every time your model makes a prediction, there’s a ground truth. The tumor is or isn’t present. You need a reference standard (also called a gold standard) to compare your model against.

What this means in practice: Your reference standard should be the best available method for determining the truth, not the easiest. If you’re building an AI for detecting pneumonia, the reference should be radiologist interpretation by experienced clinicians, ideally with multiple readers and adjudication of disagreements. What constitutes a gold standard is very domain-specific, and there might not even be consensus within a discipline, so be prepared to justify your choice of gold standard via literature references.

Your reference standard should also be consistent across the entire dataset. If some patients have a reference standard based on clinical assessment and others have it based on imaging, you have a mixing problem.

Why it matters: Your model is only as good as your reference standard. If the reference standard is noisy or biased, your model learns noise and bias. If you train against a weak reference, you’ll underestimate what the model could achieve with a strong reference. If regulators see you trained against clinical documentation (which is often incomplete or inconsistent), they’ll question whether your performance estimates are real.

For regulatory submissions, the FDA will ask: how was your reference standard determined, how was it quality-assured, and why is it the best available method? “We used the diagnosis in the EHR” is not an answer that will satisfy them, at least without documentation and evidence of how that EHR record was created.

Practical checklist:

Is your reference standard the best available method for determining ground truth? (Not the easiest, the best.)
Is it applied consistently across your entire dataset?
If multiple readers are involved, do you have reader qualifications described and documented?
Have you assessed inter-reader reliability? (Kappa or ICC for your reference standard itself.)
Are there any systematic differences in how the reference standard was determined across subgroups? (Different readers for different sites, different methods for different time periods.)

6. Model Design Tailored to Available Data and Intended Use: Fit the Model to Reality, Not Vice Versa

There’s a temptation to use the fanciest model architecture. Transformers are in vogue now. Vision transformers for images, BERT-like models for NLP. Your humble correspondent is a Reinforcement Learning fanboy, but that doesn’t mean RL is the best architecture for every problem.

What this means in practice: Your model architecture should match the size and quality of your dataset and the constraints of your intended use. If you have 500 patient records, you can’t use a transformer. You probably need something simpler (logistic regression, random forest, small neural network) that won’t overfit to your small dataset. If you’re deploying in a resource-constrained setting, you need a model that runs efficiently, not one that requires GPU inference.

Your intended use also matters. If your device needs to provide a recommendation in under 100 milliseconds (for real-time clinical use), you can’t use a model that takes 30 seconds to run inference. If clinicians need to understand why the model made a recommendation, you need something interpretable, not a black-box deep neural network.

Why it matters: A fancy model overfitted to your training data will generalize worse than a simple model trained thoughtfully. A slow model won’t be used in time-critical settings. An opaque model will fail regulatory review in contexts where interpretability is needed.

The art is choosing a model that’s complex enough to capture the signal in your data but not so complex that it memorizes noise.

Practical checklist:

Does your model architecture match your dataset size? (More parameters than samples is usually a red flag.)
Have you considered simpler models as baselines to compare against?
Does your model meet the computational requirements of your intended use? (Response time, hardware available, power consumption.)
If interpretability matters (it usually does in clinical settings), have you chosen a model that supports explanation?
Have you validated that your model generalizes? (Same performance on held-out test data as on training data, or close enough that you can explain the gap.)

7. Focus on Human-AI Team Performance: You’re Optimizing for Clinician + AI, Not AI Alone

This is a principle that often gets missed because it changes what you measure.

What this means in practice: Your clinical validation study shouldn’t measure just the AI’s performance. It should measure what happens when a clinician uses the AI versus when they don’t. Does the AI reduce error rate? Does it reduce time to diagnosis? Does it improve diagnostic agreement among clinicians? Does it increase appropriate referrals and decrease unnecessary workups?

If your AI is a diagnostic assistant, you measure accuracy when a clinician reviews the AI’s recommendation versus accuracy with standard practice. If your AI is a risk classifier, you measure whether clinician decision-making improves when given the AI’s risk score.

Why it matters: A model that’s 95% accurate in a vacuum might not improve clinical care if clinicians ignore it, over-rely on it, or get confused by it. Conversely, a model that’s 87% accurate might dramatically improve clinical care if clinicians know when to trust it and when to override it.

This is also what regulators actually care about. The FDA doesn’t certify algorithms in a vacuum. They certify devices for use in clinical contexts. So your evidence should reflect that.

Practical checklist:

Does your clinical validation study include clinician users, not just a retrospective comparison?
Are you measuring clinician performance with the AI (e.g., diagnostic accuracy when viewing the AI recommendation alongside clinical data) versus without?
Have you assessed whether clinicians understand the AI’s output and when to trust it?
Are there scenarios where clinicians should override the AI, and have you identified them?
Are you measuring something that matters clinically (time to diagnosis, diagnostic accuracy, appropriate referrals, patient outcomes), not just algorithmic accuracy?

8. Clinically Meaningful Performance Testing: Measure What Matters

Not all performance metrics are created equal.

What this means in practice: You report sensitivity, specificity, area under the ROC curve. Those are fine for academic papers. For clinical validation, you also need to report metrics that clinicians understand and care about. For a diagnostic system: positive predictive value (what fraction of positive results are actually positive in your test population), negative predictive value (what fraction of negative results are truly negative), likelihood ratios (how much the test changes the probability of disease). For a screening system: what’s the referral rate, what’s the disease detection rate, what’s the false positive rate that triggers unnecessary workups.

Report performance not just overall but by clinically relevant subgroups. If your AI is for diabetic retinopathy screening and performance is 95% overall but 85% in Black patients, that’s a problem you need to address, not hide.

Use confidence intervals, not just point estimates. “Sensitivity 94%” is less informative than “Sensitivity 94% (95% CI 91-97%).”

Why it matters: Sensitivity and specificity are important, but they don’t tell a clinician what to do. A radiologist wants to know: if this AI flags this lung nodule as suspicious, what’s the probability it’s actually cancer in my population? That’s the positive predictive value. A cardiologist wants to know: how many false alarms will this AI generate (what’s the false positive rate in my population)?

Different clinical contexts care about different metrics. A screening test needs high sensitivity (you can’t miss cases). A confirmatory test can tolerate lower sensitivity if the specificity is very high (you need to be sure before acting).

Practical checklist:

Are you reporting sensitivity, specificity, but also PPV/NPV or likelihood ratios?
Are performance metrics reported separately for demographic subgroups?
Do you include confidence intervals?
Are performance metrics reported in the population where the device will be used? (Not just your development population.)
Have you chosen a threshold for positivity that makes clinical sense, not just maximizes accuracy on your test set?
Have you measured failure modes? (How often does the AI fail to detect cases it should detect, and in what scenarios?)

9. Clear Essential Information for Users: Tell Clinicians What They Need to Know

You’ve built the AI. You’ve validated it. Now clinicians need to use it. What do they need to know?

What this means in practice: Your user documentation should include:

Intended use: What clinical problem does this AI solve? Who is it for? In what settings should it be used?
Intended user: Who is trained to use this? Radiologists, cardiologists, nurses, general clinicians?
Limitations: What are the scenarios where this AI might not work well? Are there patient populations where performance is lower? Are there imaging quality requirements or other prerequisites?
How to interpret results: What does a positive result mean? A negative result? What’s the positive predictive value in the clinician’s population? Should results always be reviewed by a physician?
When to seek additional evaluation: If the AI is uncertain or contradicts clinical suspicion, what should the clinician do?
Training requirements: What do clinicians need to know before using this?
Maintenance and monitoring: How often is the model updated? How do clinicians know if performance is degrading?

This isn’t marketing material. It’s the user manual. The FDA will review it carefully.

Why it matters: An AI that generates wrong answers is bad. An AI that generates wrong answers that clinicians trust without verification is dangerous. Clear documentation helps clinicians use the AI appropriately and know when to be skeptical.

Practical checklist:

Do you have a clear description of intended use and intended users?
Do you explicitly state limitations? (Populations with lower performance, required image quality, contraindications.)
Do you explain how to interpret results in a way clinicians understand?
Do you discuss scenarios where overriding the AI is appropriate?
Is the documentation reviewed and approved by clinician reviewers?
Do you have training materials or guidance for clinical users?

10. Deployed Models Monitored for Performance: You Ship It, Now Watch It

Your AI is approved and deployed. That’s the beginning, not the end.

What this means in practice: You need a system for monitoring how the AI performs in real clinical use. Are sensitivity and specificity holding up? Are you seeing patterns in where the AI fails? Is there population drift (the patients or imaging systems in the deployed setting are different from your validation data)? Is there data drift (the imaging quality, patient factors, or disease prevalence has changed)?

You should track key metrics continuously: accuracy on new cases (if you can get labels), false positive rate, false negative rate, performance by subgroup, processing time, any errors or exceptions the system logs. If performance degrades below acceptable thresholds, you have a process to investigate, revalidate, and retrain if needed.

Why it matters: Real-world data is messier than your validation data. Disease prevalence is different. Imaging equipment deteriorates or gets upgraded. Patient populations shift. Your model, trained on 2022 data, might not work as well in 2025 data. Without monitoring, you won’t know.

Also, this is increasingly what regulators expect. The FDA’s proposed framework for modifications to AI/ML medical devices hinges on real-world performance monitoring. If you can demonstrate that your model is performing as expected in the field, you can make certain updates without re-running full validation. If you can’t, you’re stuck.

Practical checklist:

Do you have a system to log inputs, predictions, and confidence scores for every use?
Are you tracking key performance metrics on new data as it comes in?
Do you have a threshold for acceptable performance, and a process to investigate if you fall below it?
Can you segment performance by patient group, imaging type, or deployment site to spot problems?
Do you have a plan to retrain or adjust the model if performance degrades?
Is monitoring documented and auditable?

Putting It Together: How These 10 Principles Work as a System

These principles aren’t isolated. They work together.

You start with multi-disciplinary expertise asking: what problem are we solving, and what does clinical success look like? That shapes your data collection and your validation approach.

You build software engineering rigor because your model is going to be used in clinical care, and it needs to be maintainable and trustworthy.

You assemble representative data because your model needs to work in the real world, not just your development setting.

You keep training and test sets independent because you want honest estimates of how well your model generalizes.

You use good reference standards because your model is only as good as what you train it against.

You choose a model architecture that fits your data and use case, not one that impresses people at conferences.

You validate human-AI team performance because what matters is whether clinicians make better decisions with your AI.

You measure clinically meaningful metrics because that’s what allows clinicians to actually use your device.

You document clearly for users because a model that no one understands or trusts won’t help anyone.

And you monitor performance in the field because that’s where the real test happens.

If you’re doing clinical research with AI, you’re probably already thinking about most of these. The GMLP principles aren’t revelatory. They’re a checklist that says: here’s what responsible ML development looks like, and here’s what regulators expect to see. Reading through them should feel less like “oh no, how do we do all this” and more like “yes, we’re already doing that.”

Key Takeaways

Principle 1: Multi-disciplinary expertise (clinical, statistical, software, regulatory) catches blind spots that pure data science misses.
Principle 2: Software engineering rigor (version control, testing, logging, security) isn’t optional for medical devices.
Principle 3: Representative data (diverse populations, realistic settings) is the foundation of generalization.
Principle 4: Independent train-test splits (temporally, geographically, or by site) prevent overfitting and provide honest performance estimates.
Principle 5: Strong reference standards (best available methods, consistently applied) ensure your model learns from high-quality ground truth.
Principle 6: Model design matched to your data and use case beats fancy architectures that overfit.
Principle 7: Human-AI team performance (what clinicians achieve with the AI) is what matters clinically.
Principle 8: Clinically meaningful metrics (PPV, NPV, subgroup performance) let clinicians understand how to use the AI.
Principle 9: Clear user documentation (intended use, limitations, interpretation) prevents misuse.
Principle 10: Performance monitoring in the field (tracking real-world accuracy, performance degradation) catches drift early.

Good Machine Learning Practice (GMLP): The 10 Guiding Principles

1. Multi-Disciplinary Expertise: You Need More Than Data Scientists

2. Good Software Engineering and Security Practices: Your Model Is Software

3. Representative Clinical Study Participants and Datasets: Generalization Doesn’t Happen by Accident

4. Training Data Independence From Test Sets: Don’t Peek at the Answer Key

5. Reference Datasets Based on Best Available Methods: Your Gold Standard Needs to Be Defensible

6. Model Design Tailored to Available Data and Intended Use: Fit the Model to Reality, Not Vice Versa

7. Focus on Human-AI Team Performance: You’re Optimizing for Clinician + AI, Not AI Alone

8. Clinically Meaningful Performance Testing: Measure What Matters

9. Clear Essential Information for Users: Tell Clinicians What They Need to Know

10. Deployed Models Monitored for Performance: You Ship It, Now Watch It

Putting It Together: How These 10 Principles Work as a System

Key Takeaways

What to Read Next