When agents take actions, the nature of risk fundamentally changes.
Unlike generative AI, these systems directly execute actions through digital tools and interfaces. While current agent prototypes handle tasks like scheduling meetings or booking flights, more ambitious proposals imagine agents that negotiate contracts, assist in healthcare decisions, or coordinate supply chains. Because these systems act directly in the environment, failures to meet user goals can result in financial loss, safety risks, or breakdowns in critical processes. While design choices and deployment context shape when and how failures occur, it is the agent’s real-time actions that directly cause incidents. These operational stages are therefore critical points for monitoring and intervention.
To address these risks, we need real-time failure detection.
Real-time failure detection is the use of automated monitoring systems that track agent behavior as it unfolds, flag anomalies, and either halt execution or escalate to human oversight.
Ensuring robust detection is not without challenges: deploying agents is already resource-intensive, and monitoring systems can add comparable costs. If poorly calibrated, monitoring may overwhelm users and operators or miss critical failures. These challenges raise important questions about how well detection can scale, and whether market and policy incentives will support its adoption.
The full report is structured around four central claims:
- Agents require new forms of failure detection due to their ability to effect change in the environment.
- The risk of agent failures — and the necessity of real-time detection — depends on the stakes of actions, their reversibility, and the agent’s architectural affordances.
- Safety-critical industries show failure detection can reduce harms and provide a foundation for safer agent design.
- Significant technical research and regulatory guidance must be prioritized to close gaps in designing and evaluating failure detection for AI agents.
Which agents warrant these controls?
Agents vary in their influence on digital environments. We focus on Levels 3–5, where agents not only inherit errors from foundation models but also introduce new, compounding failure modes by acting autonomously across multiple steps – making real-time detection essential. Current systems are still early, but as capabilities evolve, so will the need for built-in monitoring.
When is failure detection most necessary?
Our framework highlights three factors that determine how robust failure detection should be: the stakes of an agent’s actions, the reversibility of its actions, and the affordances it is given.
High-stakes actions (e.g. handling sensitive data, tasks affecting individual health or safety), irreversible outcomes (e.g. financial transfers, deletion of records), and expansive affordances (e.g. memory, dynamic tool use) all increase the need for reliable, real-time controls. Because no single control can catch every issue, they must be layered across the agent workflow – operating before actions are taken, during execution, and across steps – to intercept different classes of failures.
The tables below provide illustrative examples of how these factors appear in practice. The full report explores them in greater detail.
Stakes
Stakes reflect how serious the consequences could be if an agent fails.
Stakes | Agent Attributes/Actions |
High | Can access sensitive personal and financial data |
Can access sensitive personal and financial data | |
Can trigger legal liability through communications or representations | |
Handles tasks in a regulated high-risk domain | |
Performs tasks in contexts affecting individual health, safety, or wellbeing | |
Can alter critical code or system operations | |
Low | Creates user-facing content (e.g., bios, resumes, websites) |
Performs scheduling tasks | |
Summarizes content |
Reversibility
Reversibility refers to how easily a failure can be corrected or undone once an agent has taken an action. Early detection prevents cascading failures.
Reversibility | Agent Attributes/Actions |
Irreversible | Initiates financial transactions |
Deletes or overwrites data | |
Sends communications | |
Reversible | Acts through third-party APIs with conditional reversibility |
Operates in a sandboxed or test environment |
Affordances
Affordances refer to what an agent’s architecture “affords” or enables it to do. As affordances increase, failures are more likely to emerge in subtle or cascading ways, requiring layered failure detection mechanisms to ensure safety.
Affordances | Agent Attributes/Actions |
Unconstrained | Dynamically selects and chains tools |
Persistent memory across sessions | |
Extended reasoning and planning capabilities | |
Constrained | Uses predefined tools and workflows |
Operates with episodic memory only |
Advancing these approaches will require building technical capacity, shared evaluation practices, and baseline norms so these controls are reliable and scalable. We need a public discussion about architectural norms before agent deployments scale. Acting now through research, evaluation, and policy can help ensure risk management practices evolve alongside the systems they govern.
Authors
The authors span academia, industry, and civil society.
Madhulika Srikumar
Partnership on AI
Kasia Chmielinski
Partnership on AI
Jacob Pratt
Partnership on AI
Carolyn Ashurst
Alan Turing Institute
Chloé Bakalar
OpenAI
William Bartholomew
Microsoft
Rishi Bommasani
Stanford Institute for Human-Centered Artificial Intelligence
Peter Cihon
GitHub
Rebecca Crootof
University of Richmond School of Law
Mia Hoffmann
Center for Security and Emerging Technology
Ruchika Joshi
Center for Democracy and Technology
Maarten Sap
Carnegie Mellon University and Allen Institute for AI
Caleb Withers
Center for a New American Security