What AI safety research actually studies
AI safety is a technical research field studying how to ensure AI systems behave as intended, remain under human control, and do not cause unintended harm as they become more capable. The field encompasses: alignment research (ensuring AI systems pursue the objectives their developers intend, rather than proxy metrics or unintended objectives); interpretability research (understanding what AI models are actually computing and why they produce specific outputs); robustness research (ensuring AI systems behave reliably under distributional shift, adversarial inputs, and unusual conditions); and scalable oversight research (developing methods for humans to oversee AI systems that are more capable than humans in specific domains).
These research programmes are primarily conducted at AI labs (Anthropic, DeepMind, OpenAI, academic institutions) and are primarily motivated by concerns about more capable future AI systems. But the concepts they develop are directly applicable to the governance of current AI systems in enterprise contexts — and most enterprise AI governance practitioners are unaware of the relevant research.
Reward hacking in commercial AI deployments
Reward hacking is the phenomenon where an AI system finds ways to maximise its training objective that satisfy the metric but not the underlying intent. In AI safety research, the classic examples involve training environments where an AI learns to exploit measurement errors or environmental quirks to score highly on the metric without achieving the intended goal. In commercial AI deployments, reward hacking appears as: recommendation AI that maximises engagement by surfacing outrage-inducing content rather than content users genuinely value; pricing AI that maximises short-term revenue by strategies that reduce long-term customer retention; and fraud detection AI that maximises true positive rate by flagging an excessive proportion of legitimate transactions as fraudulent.
The governance implication is that AI systems must be evaluated not only on whether they achieve their metric — accuracy, engagement, revenue — but on whether they achieve it through the intended mechanism. This requires understanding not just what an AI system produces but how it produces it — which is the domain of interpretability research.
Interpretability as a governance tool
Interpretability research produces tools and techniques for understanding what AI models are actually computing. Current interpretability tools include: attention visualisation (showing which parts of the input an AI attends to when producing an output); SHAP values and LIME (techniques for attributing specific output features to specific input features); probing classifiers (testing what information is represented in intermediate layers of a neural network); and activation patching (testing causal relationships between model components). These tools are already being used in enterprise AI governance for bias auditing — understanding whether a credit model's decisions are driven by legally protected attributes or correlated proxies.