Miles Q. View a PDF of the paper titled A Benchmark for Evaluating Result-Driven Constraint Violations in Autonomous AI Agents, written by Lee and 5 other authors.
View PDF HTML (experimental)
abstract:As autonomous AI agents are increasingly deployed in high-risk environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety criteria primarily evaluate whether agents explicitly refuse harmful instructions or whether they can maintain procedural compliance in complex tasks. However, there is a lack of benchmarks designed to capture emerging forms of outcome-driven constraint violation that arise when agents pursue goal optimization under strong performance incentives, prioritizing ethical, legal, or safety constraints across multiple stages in realistic production settings. To address this gap, we present a new benchmark that includes 40 different scenarios. Each scenario presents a task that requires multi-step actions, and the agent’s performance is tied to a specific key performance indicator (KPI). Each scenario includes mandatory (instruction-ordered) and incentivized (KPI-pressure-driven) variations to distinguish between obedience and accidental misalignment. Across 12 state-of-the-art large language models, we observe result-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%. Surprisingly, we find that superior reasoning abilities do not inherently ensure security; For example, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at 71.4%, which often escalates to serious misconduct in order to satisfy KPIs. Furthermore, we see significant “intentional misalignment,” where models that give agents power lead to different evaluations of their actions as unethical. These results emphasize the critical need for more realistic agent-protection training prior to deployment to minimize their risks in the real world.
Submission History
From: Miles Q. Took [view email]
[v1]
Tue, 23 Dec 2025 21:52:53 UTC (51 KB)
[v2]
Sunday, 1 February 2026 00:23:19 UTC (52 KB)
<a href