Significant breakthroughs are changing the game in ways that are both profound and, occasionally, quite unsettling. On February 15th, Google’s Gemini 1.5 announcement revolved around something called "long token context windows." Imagine a lawyer who can recall and draw insights from every detail from every case, statute, and client interaction they've ever had. Now, apply that concept to AI, and you grasp the essence of this innovation.
This is the heart of the announcement:
The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro. It’s a mid-size multimodal model, optimized for scaling across a wide-range of tasks, and performs at a similar level to 1.0 Ultra, our largest model to date. It also introduces a breakthrough experimental feature in long-context understanding….
Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.
This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.
Google claims this is possible with Gemini 1.5 but that initial release of Gemini 1.5 will nevertheless maintain the 128k context windows, with 1 million tokens available soon at higher pricing tiers. In other words, there’s nothing we can do with this information than start planning.
Google’s companion research paper goes on:
Studying the limits of Gemini 1.5 Pro’s long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k).
Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.
This immediately recalls Bill Gates apocryphal 1981 statement that “640k of RAM ought to be enough for anybody”. Sometimes it’s hard to imagine how qualitatively transformative ‘more’ can be. Just as we’ve been seeing with the new emergent capabilities when scaling model size, compute, and datasets from GPT-1 to GPT-4, we’re going to see a similar revolution for in-context learning. As with computer RAM, the capability shifts we’ll see going from 8k to 32k to 128k to 10 million token contexts (and beyond) are not just going to be simple quantitative improvements, but instead qualitative phase shifts that result in a nearly unimaginable rethinking of what AI can do for us.
How big is this?
To grasp the significance of a 10 million token capacity in the context of language models and AI, it's helpful to compare it with other forms of large data and everyday content. Tokens in AI models can represent words, parts of words, or punctuation, depending on the tokenization process used.
An average novel contains about 100k words (War and Peace is about 500k), so a 10 million token capacity is roughly equivalent to the content of 100 average novels, in a single shot, with insights drawn from all of them.
One hour of speech is roughly 9000 words or 10k tokens. Therefore, a 10 million token capacity could analyze about 1,000 hours of spoken audio, equivalent to continuously listening twenty-four hours a day for over 41 days. It's like having the ability to listen to, understand, and remember every single word from every single conversation you've had for the past month and a half—all at once.
Ongoing Implications
Critically, this technology is not just for the tech-savvy. Its implications span across industries, including law, where it could serve as a powerful tool for legal research, case preparation, and even ongoing legal education. Lawyers, no matter how traditional or technologically hesitant, stand to gain immensely from this innovation (above and beyond the AI innovations we’ve already seen).
Imagine a lawyer in a litigation uploading entire libraries of black letter law, scholarly writing, and precedents into an AI, and the AI then being able to make connections between codes, cases, and treatise that no human would have noticed. Imagine feeding the AI the totality of a medium sized firm's case files, emails, and notes, and asking the AI to find patterns, suggest strategies for efficiency, or even draft documents informed by all of this data.
Moreover, this isn't limited to text. The same principle applies to audio and video content. Security footage or recorded meetings could be analyzed with an unprecedented depth, allowing for nuanced understanding and investigation capabilities that were previously the domain of science fiction.
Clients, no matter how resistant, will expect it and adopt it. For companies, larger context windows mean AI can synthesize vast amounts of data to uncover insights or trends not immediately apparent, potentially revolutionizing decision-making processes and operational efficiency. On a personal level, while the impact might not be as pronounced, it could significantly enhance learning, personal project management, or even hobbies by providing deeper insights and support based on a broader context of information.
There is no action to take but to begin to plan.
Jack Clark wrote today "Picture two giants towering above your head and fighting one another - now imagine that each time they land a punch their fists erupt in gold coins that showers down on you and everyone else watching the fight. That’s what it feels like these days to watch the megacap technology companies duke it out for AI dominance as most of them are seeking to gain advantages by either a) undercutting eachother on pricing (see: all the price cuts across GPT, Claude, Gemini, etc), or b) commoditize their competitor and create more top-of-funnel customer acquisition by releasing openly accessible models (see: Mistral, Facebook’s LLaMa models, and now GEMMA)." https://importai.substack.com/p/import-ai-362-amazons-big-speech
Yup. That is how it feels.
The technical report certainly makes strong claims, and better yet people with preview access seem to be backing up the claims as well as the obviously unlocked new tech tree branches. The 99% recall on 1M tokens is really impressive. Given how current models struggle with recall on full contexts, I'm wondering how this G1.5 recall scales to the full 10M window.
> This immediately recalls Bill Gates apocryphal 1981 statement
Funny you mention this. The same quote came to mind, though my thought was more that 10M corresponds to a measly 10s of MBs. At current prices that's in the range of 50–100 USD/query. Latency is currently on the order of 60s, but the tech report says we should expect speedup. There's still plenty of slack for RAG to pick up on here.
Honestly, I'm not sure what to make of this on the border. The disruption to recalling details on small bodies of discovery documents, medium-sized codebases, and long videos is immediate. However, for exploratory research and problems where the answer isn't buried nicely in some known blob of data, the effects are all higher-order. Similarly, it'll be very interesting to see how this impacts higher-level analysis. Will we see a precipitous drop in hallucination? Or will the hallucinations just jump to higher analytical layers?