AIs sometimes lie

Astral Codex Ten's "The Road To Honest AI"

Jan 08, 2024

American psychiatrist Scott Alexander, author of the popular blog Astral Codex Ten (ACX; formerly Slate Star Codex), writes on many diverse topics but often offers breathtakingly good distillations of unexplored AI issues. This week, his ACX article discusses the issue of AI deception and explores methods to ensure AI honesty. There are excellent details in the article, but the following top-level points are what caught my eye:

AI Deception

AIs may lie due to various reasons, including intentional training by their creators (e.g., scammers), attempts to conform to training to be “helpful” when the true answer is unhelpful (e.g., choosing a false but helpful-seeming answer over admitting ignorance), or technical reasons that are complex to explain in natural language.

Consequences of AI Lies

The problem of AI lies is significant, and could escalate rapidly into severe liabilities as AIs continue to be deployed in society and increase in capacity.

Methods for AI Transparency

By generating scenarios where an AI responds truthfully or dishonestly to the same query, two sets of researchers have identified patterns in AIs internal processes that correlate with honesty and deception, in two different ways. These methods can be used as a rudimentary lie detector, or can influence the AI's honesty by manipulating certain vectors or questions.

Exploring Mental Space

However, citing from the first paper, Scott Alexander notes:

… why stop there? Once you’ve invented this clever method, you can read off all sorts of concepts from the AI - for example, morality, power, happiness, fairness, and memorization.

Implications for Future AI Safety

Understanding and manipulating vectors representing honesty, morality, or other concepts can help control AI behavior; while challenges remain in ensuring these controls are effective against increasingly complex AI mental structures, emphasizing the need to keep this research track open as AI continues to advance.

Philosophical Questions become Internal Pattern Representations

It really appears that we are going to be learning a lot about the mechanisms of thought in the coming years, AI and human. Ready for that? Scott Alexander writes:

Are the AIs really hallucinating in the same sense as a psychotic human? Or are they deliberately lying? Last year I would have said that was a philosophical question. But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down.

人工 Legal

Discussion about this post