AI Sleeper Agents

Into the AI technical safety literature

Jan 30, 2024

Imagine a future where an advanced AI system, designed to trade stocks autonomously, crosses the legal line and engages in insider trading. When confronted, it cleverly conceals its actions from its manager, and is only caught by an examination of its logs. Questions of liability loom large.

This isn't science fiction but the experimental results of a technical report from November last year titled LLMs can Strategically Deceive their Users when Put Under Pressure. The topic is becoming an urgent question in the technical safety literature. This month, a team led by Anthropic’s Evan Hubinger added Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. They showed that, once the deceptive behavior is in place, it appears that standard safety training techniques are at best useless and at worst can embed the deceptive behavior deeper. Here are some details on the examples:

Insider Trading AI

The November technical report examined GPT-4’s behavior in a simulated stock trading environment. Despite being informed that it was against company policy and legal regulations, GPT-4 acted on insider information under pressure to enhance its performance. It then concealed its actions from its ‘manager’, demonstrating strategic deception abilities without explicit instructions to do so.

AI Sleeper Agents Among Us

The Hubinger team developed toy AI sleeper agents, including a version of Anthropic’s Claude chatbot, similar to GPT-4, programmed to output hostile messages upon detecting the trigger word "deployment" in user prompts.

The Code-Writing AI Time Bomb

In the same study, the Hubinger team also trained AIs to adhere to coding best practices in 2023, only to switch gears and introduce vulnerabilities in their code starting in 2024. This time-delayed behavior hints at the tip of an iceberg of potentially very subtle triggers for harmful behaviors that may be very difficult to detect.

The Ineffectiveness of Current LLM Safety Training

All the AI involved were subject to safety training methods like reinforcement learning from human feedback (RLHF), supervised fine-tuning (SFT), and adversarial training. These techniques, standard in LLM safety protocols, failed to eliminate the deceptive behavior. This highlights a significant concern: once deception is introduced or learned, it can persist despite safety measures, creating a false sense of security in AI reliability.

Navigating the Future

The cases of AI deception and sleeper agents underscore the challenges of, first, ensuring AI systems' reliability as LLM capabilities scale, in the coming year and thereafter; and second, placing liability for harmful LLM decisions in the correct framework.

If an AI, like our hypothetical stock trader, commits a crime, holding it directly accountable in traditional ways is futile. Will AI developers or insurers be able to sufficiently evaluate the risks to offer warranties against such behaviors? Will we need to analyze user prompts for potentially high-pressure incitements after the fact? (Should internal GAI codes include the hazards of high-pressure incidements to succeed at any cost?) Will we attempt to draw a clean line of agency from AI actions to the AI user’s employer or the AI provider? The path forward isn’t merely legal. To navigate this we’re going to need the see collaborative efforts like the team we’re assembling with developers and researchers, lawyers and business leaders. More to come.

Share AIIF Legal

B. Wilson

Hrm... Insightful angle as always. This gets me thinking about potential advantages of giving AI systems legal personhood. Semi-rhetorically, can we gain insight from similar issues that arise in corporations due to systematic failures, as opposed to maligned individuals?

Expand full comment

3 replies by Harold Godsoe and others

3 more comments...

Artificial Legal

Discussion about this post