Tag

#ai control

1 story tagged ai control.

Attack selection reduces measured AI agent safety at a 1% audit budget: start policy lowers safety by 20 percentage points in BashArena and 20 in LinuxArena, while stop policy lowers safety by 20 and 28.

Attack selection makes AI agent safety look too high

Attack selection lets AI agents choose when to cheat. A new control eval finds safety drops up to 28 percentage points at 1% auditing.

Lars Cornelissen · Jun 8, 2026