Our Resources
/

Prioritizing Real-Time Failure Detection in AI Agents

$hero_image['alt']

When agents take actions, the nature of risk fundamentally changes.

Unlike generative AI, these systems directly execute actions through digital tools and interfaces. While current agent prototypes handle tasks like scheduling meetings or booking flights, more ambitious proposals imagine agents that negotiate contracts, assist in healthcare decisions, or coordinate supply chains. Because these systems act directly in the environment, failures to meet user goals can result in financial loss, safety risks, or breakdowns in critical processes. While design choices and deployment context shape when and how failures occur, it is the agent’s real-time actions that directly cause incidents. These operational stages are therefore critical points for monitoring and intervention.

To address these risks, we need real-time failure detection.

Real-time failure detection is the use of automated monitoring systems that track agent behavior as it unfolds, flag anomalies, and either halt execution or escalate to human oversight.

Ensuring robust detection is not without challenges: deploying agents is already resource-intensive, and monitoring systems can add comparable costs. If poorly calibrated, monitoring may overwhelm users and operators or miss critical failures. These challenges raise important questions about how well detection can scale, and whether market and policy incentives will support its adoption.

The full report is structured around four central claims:

  1. Agents require new forms of failure detection due to their ability to effect change in the environment.
  2. The risk of agent failures — and the necessity of real-time detection — depends on the stakes of actions, their reversibility, and the agent’s architectural affordances.
  3. Safety-critical industries show failure detection can reduce harms and provide a foundation for safer agent design.
  4. Significant technical research and regulatory guidance must be prioritized to close gaps in designing and evaluating failure detection for AI agents.

Which agents warrant these controls?

Agents vary in their influence on digital environments. We focus on Levels 3–5, where agents not only inherit errors from foundation models but also introduce new, compounding failure modes by acting autonomously across multiple steps – making real-time detection essential. Current systems are still early, but as capabilities evolve, so will the need for built-in monitoring.

When is failure detection most necessary?

Our framework highlights three factors that determine how robust failure detection should be: the stakes of an agent’s actions, the reversibility of its actions, and the affordances it is given.

High-stakes actions (e.g. handling sensitive data, tasks affecting individual health or safety), irreversible outcomes (e.g. financial transfers, deletion of records), and expansive affordances (e.g. memory, dynamic tool use) all increase the need for reliable, real-time controls. Because no single control can catch every issue, they must be layered across the agent workflow – operating before actions are taken, during execution, and across steps – to intercept different classes of failures.

The tables below provide illustrative examples of how these factors appear in practice. The full report explores them in greater detail.

 Stakes

Stakes reflect how serious the consequences could be if an agent fails.

Stakes Agent Attributes/Actions
High Can access sensitive personal and financial data
Can access sensitive personal and financial data
Can trigger legal liability through communications or representations
Handles tasks in a regulated high-risk domain
Performs tasks in contexts affecting individual health, safety, or wellbeing
Can alter critical code or system operations
Low Creates user-facing content (e.g., bios, resumes, websites)
Performs scheduling tasks
Summarizes content

 Reversibility

Reversibility refers to how easily a failure can be corrected or undone once an agent has taken an action. Early detection prevents cascading failures.

Reversibility Agent Attributes/Actions
Irreversible Initiates financial transactions
Deletes or overwrites data
Sends communications
Reversible Acts through third-party APIs with conditional reversibility
Operates in a sandboxed or test environment

 Affordances

Affordances refer to what an agent’s architecture “affords” or enables it to do. As affordances increase, failures are more likely to emerge in subtle or cascading ways, requiring layered failure detection mechanisms to ensure safety.

Affordances Agent Attributes/Actions
Unconstrained Dynamically selects and chains tools
Persistent memory across sessions
Extended reasoning and planning capabilities
Constrained Uses predefined tools and workflows
Operates with episodic memory only

Advancing these approaches will require building technical capacity, shared evaluation practices, and baseline norms so these controls are reliable and scalable. We need a public discussion about architectural norms before agent deployments scale. Acting now through research, evaluation, and policy can help ensure risk management practices evolve alongside the systems they govern.

Authors

The authors span academia, industry, and civil society.

Madhulika Srikumar
Partnership on AI

Kasia Chmielinski
Partnership on AI

Jacob Pratt
Partnership on AI

Carolyn Ashurst
Alan Turing Institute

Chloé Bakalar
OpenAI

William Bartholomew
Microsoft

Rishi Bommasani
Stanford Institute for Human-Centered Artificial Intelligence

Peter Cihon
GitHub

Rebecca Crootof
University of Richmond School of Law

Mia Hoffmann
Center for Security and Emerging Technology

Ruchika Joshi
Center for Democracy and Technology

Maarten Sap
Carnegie Mellon University and Allen Institute for AI

Caleb Withers
Center for a New American Security