What Is Model Drift? Why It Happens and Why It Matters for AI Governance

Model drift is the degradation of an AI model's performance over time as the world changes. It is one of the most common causes of AI governance failure in production — and most organisations have no monitoring for it.

Key Takeaways

Model drift occurs when an AI model's real-world performance degrades over time because the statistical distribution of real-world data diverges from the training data distribution. This is not a bug — it is an inevitable consequence of deploying AI in a changing world.
There are two types of drift: data drift (the inputs to the model change) and concept drift (the relationship between inputs and the correct output changes). Both cause performance degradation; concept drift is typically more severe.
Model drift can cause governance failures without triggering obvious alerts — a model may continue to produce outputs within technical tolerance while producing increasingly biased, inaccurate, or unfair results in production.
EU AI Act Article 72 requires providers of high-risk AI systems to establish a post-market monitoring system; Article 9 requires that the risk management system be informed by post-market data — monitoring for model drift is a compliance requirement, not merely a best practice.
The practical monitoring programme for model drift: define performance metrics at deployment, set thresholds that trigger review, monitor in production continuously, review performance at defined intervals, and establish a retraining or replacement protocol when drift is detected.

"Nur zu Informationszwecken. Dieser Artikel stellt keine rechtliche, regulatorische, finanzielle oder professionelle Beratung dar. Konsultieren Sie einen qualifizierten Spezialisten für spezifische Beratung."

Why model drift happens

Every AI model is trained on historical data that reflects the world as it was at the time the data was collected. The world does not stop changing after the model is deployed. Consumer behaviour changes. Economic conditions change. Regulatory requirements change. The composition of the population using a service changes. New products, services, and interactions emerge that were not in the training data. As the gap between the training data distribution and the real-world data distribution widens, the model's predictions become less accurate — this is model drift.

Data drift is the simpler form: the characteristics of the inputs to the model change over time. A fraud detection model trained on pre-pandemic transaction patterns will encounter a different distribution of transactions post-pandemic — different locations, different merchants, different transaction sizes. The model was not trained on this distribution and will perform less well on it. Concept drift is more subtle and more damaging: the underlying relationship between inputs and the correct output changes. A credit risk model trained when economic conditions were benign may have learned relationships that no longer hold when economic conditions deteriorate. The model receives the same types of inputs but the correct output for those inputs has changed in ways the model cannot detect because it only knows the world it was trained on.

Monitoring for model drift: the practical programme

Step 1 — Define performance metrics at deployment. Before deploying an AI model, define the metrics that will be used to assess its performance in production: accuracy on a held-out test set, fairness metrics across demographic groups, business outcome metrics, and any regulatory compliance metrics. Document these metrics and their acceptable ranges. Step 2 — Establish monitoring infrastructure. Implement logging of model inputs, outputs, and ground truth outcomes (when available) in production. For models where ground truth is delayed (credit models where loan outcomes are only known months later), implement proxy metrics that can provide earlier signals of performance degradation. Step 3 — Set alert thresholds. Define the thresholds that will trigger a review: how much degradation in each metric before an alert is generated? Set these thresholds deliberately — too sensitive and you generate alert fatigue; too insensitive and you miss genuine drift. Step 4 — Conduct scheduled performance reviews. Beyond alert-triggered reviews, conduct scheduled performance assessments at defined intervals — at minimum quarterly for high-risk AI. These reviews should examine performance trends, not just point-in-time performance.

Auf Englisch lesen