There is a gap between the incremental progress in AI capabilities and research and the portrayal of these advancements in business and government discourse. Let’s look at two statements made almost concurrently: one from Andreessen Horowitz (a16z), asserting that AI models have achieved a level of transparency previously thought unattainable, and another from a leading AI lab tester, which quietly revealed the lab’s most advanced model's unexpected meta-awareness—a phenomenon not at all understood.
AI Models Are (Still) Black Boxes
I’ve written before about a16z’s prestige and AI enthusiasm. It’s a world leading tech venture capital firm with “more than £28 billion in committed capital under management across multiple funds … that invests in startups that both build and rely on artificial intelligence technologies”. In written testimony to the UK House of Lords at the end of 2023, a16z made a bold claim about how much we know about how AI works:
The disclosure of [AI] code for public review, while seemingly counterintuitive, significantly enhances security.… Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.
Neel Nanda of Google Deepmind, a prominent figure in AI interpretability, expressed one of many astonished replies: “I work on interpretability research, and can confidently say that this is completely false, [the black box nature of AI models is] still a major open problem.”
Admittedly, there have been recent advances in interpretability within AI (particularly by Mr. Nanda), contributing to increased optimism about understanding AI models more in the future. However, these advancements have not "resolved" any issue to the extent of ensuring safety, integrity, or making the logic behind AI conclusions transparent, as a16z claims.
If a16z can’t be bothered to understand (or be concerned about the potential blowback for lying) when testifying to the UK House of Lords, what does that say about expectations of AI discourse among businesses and decision-makers?
Claude 3 Opus “Knows” When It Is Being Watched
How much we don’t know about the ‘black box’ interior of LLMs continues to be revealed. This is a long but fascinating passage from one of the researchers testing Anthropic’s Claude 3 Opus LLM model (not yet released; and claimed to have greater capabilities than GPT-4):
Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval.
For background, this tests a model’s recall ability by inserting a target sentence (the ‘needle’) into a corpus of random documents (the "haystack") and asking a question that could only be answered using the information in the needle. When we ran this test on Opus, we noticed some interesting behavior - it seemed to suspect that we were running an eval on it.
Here was one of its outputs when we asked Opus to answer a question about pizza toppings by finding a needle within a haystack of a random collection of documents:
[Opus:] “Here is the most relevant sentence in the documents: ‘The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association.’ However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.”
Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.
The scenario underscores the truly unexplored complexity and nuances of AI capabilities. Exciting, yes! Ready to be released, open source, through government regulation or in products, in every large computer in the world? That decision needs a higher level of AI discourse.
Together all of this is a reminder of the need to be actively and continually curious about these technological advancements, especially among business leaders and policymakers. Misrepresentation is going to lead to misguided expectations and policies and, potentially, serious misallocations of resources in the AI field.
Claude 3's ability to produce introspective-seeming texts is eerily similar to that of humans, though it's couched in HHH verbosity, as all RLHF models these days.
> Anthropic’s Claude 3 Opus LLM model (not yet released ...)
Paid subscriptions have access to opus as well as the other models. Were you meaning "released" as in open sourced architecture and weights?