The Stanford AI Index Annual Report: pt 3

The Scaling Laws

May 20, 2024

Eric Schmidt, former CEO of Google, was on MSNBC on May 7 to say that AI is “under-hyped”. Is there support for more to come from AI in the 2024 AI Index Annual Report? Schmidt said:

“The arrival of intelligence of a non-human form is a really big deal for the world … It’s coming, it’s here, it’s about to happen, it happens in stages. We used to say 20 years, now [we say] within 5. And the reason is the scaling laws for these systems are continuing to go up without any loss or degradation of power.”

We’ve seen that most of the 500-page report is at least consistent with past technology hype cycles, and that AI technical benchmark tests are struggling to provide clear results in light of AI outputs. And also that the evidence is strong that people get tangible productivity and profit benefits from adoption and use of current AI in many domains.

Frankly, this isn’t enough evidence to say that AI is “under-hyped” or that bigger changes are coming in 5 years. What we need is evidence about the heart of Schmidt’s claim: the scaling laws.

Simplified, the term “scaling laws” in current AI refers to certain observed relationships between (A) an AI model’s size (in parameters, data used in training, computer hardware used in training, and a few other areas) and (B) the performance of an AI model on certain tasks. They are directly analogous to “Moore’s Law” in computing, which observed the cost of calculations on a chip to halve over predictable periods of time. These laws have been informing the design and training of AI models over recent years and provide some tentative insights into the underlying nature of the AI systems. The scaling laws are, economically, why all the AI companies are throwing millions (and very soon billions) of dollars into training each AI system, and why those very expensive AI systems seem to perform so well.

The “scaling laws” are still more hypothesis than law. Yet, it’s disappointing to see that information on the continuation of the scaling laws is absent from the Report.

The Report does provide mixed support for Schmidt's claim in some key parts of the Report’s chapters on R&D and Technical Performance on (A) the size of models in the past year and (B) correlations in their performance. However, the evidence does not support any predictions of a transformative shift in a 5-year timeframe and, indeed, the evidence of performance increases in 2023 in line with the scaling laws is mixed.

(Note that, in 2024, since the report was published, new models have been released with what appears to be new capabilities: Anthropic’s Clause Opus on 11 Apr 2024 and OpenAI’s ChatGPT-4o on 13 May 2024. Evaluation of these new models, and the heavily rumored Google Gemini 2 and OpenAI ChatGPT-5 later in 2024, may yet prove the scaling laws are strong and healthy.)

Finally, another method to measure the scaling law performance—perhaps a grim measure—is in the increases in incidents covered in the Report’s chapter on ‘Responsible AI’. While there is no strong evidence of new capabilities from AI harms, this is another area to keep an eye on as the AI models get exponentially larger.

(A) Size “exponentially larger”

Subsection 1.3 of the Report deals with frontier research. The Report’s information in this subsection comes from Epoch AI. The Report details the immense parameters, data used in training, computer hardware, and associated costs for leading-edge AI models in 2023.

The parameters in a machine learning model are separate numbers adjusted as the model learns from its training data. The parameters ultimately are the direct influence on how a prompt is transformed into output.

Figure 1.3.5 in the Report shows the parameter count of machine learning models from 2003 as they increase in size exponentially. Before 2012 models had under 100 million parameters, in 2016 they reached almost 10 billion … by 2023 machine learning models had nearly 1 trillion parameters.

The Report explains floating-point operation (FLOP) used in training machine learning models as “[A] single arithmetic operation involving floating-point numbers, such as addition, subtraction, multiplication, or division. The number of FLOPs a processor or computer can perform per second is an indicator of its computational power. The higher the FLOP rate, the more powerful the computer is.”

Figure 1.3.6 in the Report shows the computer hardware used in training of notable machine learning models since 2003, and the increase is exponential. Before 2012 models used under 100 petaFLOPS, by 2014 models used 10,000 petaFLOPS, 2015 models used 1,000,000 petaFLOPS, 2016 models used 100,000,000 petaFLOPS … by 2023 hardware was estimated to exceed 100,000,000,000 PFLOPS to train a single model.

Figure 1.3.22 in the Report shows costs of AI models going up exponentially from 2016. Before 2020, costs were under $1 million. 2022 models reached $10 million. 2023 models exceeded $100 million.

Note that the chart in Figure 1.3.22 ends with OpenAI’s GPT-4 and Google’s Gemini Ultra which were estimated to be around $78 million and $191 million.

These were not the most recent release of an AI model as of May 2024. We have no cost information in the Report on OpenAI’s newest GPT4v or GPT4o models or the third leading AI model, Anthropic’s Claude Opus. Each of these unlisted models were released after the Report research was completed. However, we do have this quote from Anthropic’s co-founder, Dario Amodei on 23 Aug 2023, echoing cost estimates from other companies and industry observers:

Interviewer: “Are we going to hit the limits of the scaling laws?”
Amodei: “Not anytime soon. Right now, the most expensive model costs around $100 million. Next year we will have $1+ billion models. By 2025, we may have a $10 billion model.”

(B) But Are We Getting Performance?

Chapter 2 of the Report deals with technical performance and reviews a series of benchmarks demonstrating the rapid progression of GPT-4 and Gemini Ultra, on complex and nuanced tasks.

In language understanding, the Holistic Evaluation of Language Models (HELM), introduced by Stanford researchers in 2022, was designed to evaluate LLMs across diverse scenarios, including reading comprehension, language understanding, and mathematical reasoning. By January 2024, GPT-4 emerged as a leader on the HELM leaderboard, reflecting its superior performance across multiple scenarios.

The Massive Multitask Language Understanding (MMLU) benchmark further challenges AI systems by assessing their performance across 57 subjects in zero-shot or few-shot scenarios, ranging from the humanities to STEM and social sciences. Google’s Gemini Ultra achieved the highest scores, handling a broad range of complex topics.

In the area of truthfulness, the TruthfulQA benchmark evaluates the truthfulness of LLMs in generating answers to questions designed to challenge commonly held misconceptions. Here, GPT-4 achieved scores nearly three times higher than earlier models.

Reasoning capabilities of AI have also seen considerable advancements, prompting the need for more demanding benchmarks like the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU). This benchmark involves a wide array of subjects and complex question formats, from charts to chemical structures. As of early 2024, Gemini Ultra bested the other models in comprehensive reasoning capabilities across all tested disciplines.

Causal reasoning is assessed through the BigToM benchmark, which evaluates theory-of-mind (ToM) capabilities in LLMs—crucial for understanding and attributing mental states such as beliefs and intentions. In this benchmark, GPT-4 has shown near human levels in tasks requiring the prediction of future events and retroactive inference of causes.

However, despite all the investments, the actual performance gains in newer, more challenging benchmarks suggest a nuanced picture. The models are indeed improving, but the rate of improvement and the application of these advancements in different settings remain complex and didn’t appear fully as predicted by the scaling law in 2023 for several reasons.

First, there are other ways to increase performance, other than the emergent performance of larger and larger models, as predicted by the scaling laws. Techniques for improving these models include advanced prompting, fine-tuning, and attention mechanisms. While prompting doesn’t directly apply to all benchmarks, the Report describes two new examples of prompting techniques:

Chain of thought (CoT) and Tree of Thoughts (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method, Graph of Thoughts (GoT), that has also shown promise (Figure 2.12.1). GoT enables LLMs to model their thoughts in a more flexible, graph-like structure which more closely mirrors actual human reasoning.
A paper from DeepMind has introduced Optimization by PROmpting (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithmic performance. OPRO uses natural language to guide LLMs in creating new prompts based on problem descriptions and previous solutions… Compared to other prompting approaches like “let’s think step by step” or an empty starting point, ORPO leads to significantly greater accuracy on virtually all 23 BIG-bench Hard tasks.

Second, additional studies highlight the variability in AI performance over time, particularly in publicly available large language models (LLMs) like GPT-3.5 and GPT-4. Research conducted by teams at Stanford and Berkeley observed significant performance declines in these models over a short period. For example, between March and June 2023, GPT-4's ability to generate code, answer sensitive questions, and solve mathematical problems worsened significantly. This decline in performance is attributed to the model's decreasing ability to follow instructions, indicating that LLM capabilities can fluctuate and regress.

Moreover, a study by researchers from DeepMind and the University of Illinois at Urbana–Champaign tested GPT-4's reasoning abilities across several benchmarks. They discovered that the model's performance declined when it was required to self-correct without guidance. This body of research collectively underscores the complexities and challenges in evaluating AI systems over time.

(B) … Dangers?

If we can’t necessarily determine whether scaling laws are causing increased performance in machine learning models, we might instead look to their effects.

Chapter 3 of the Report makes use of the AI Incident Database (AIID), which documents instances of AI harms and has found significant annual increases in such incidents. In 2023, the AIID had 123 reported incidents, marking a 32.3% increase from the previous year and representing a more than twentyfold increase since 2013. This rise may be attributed to the broader integration of AI and improved incident tracking, which could suggest that earlier incidents may have been underreported.

Nevertheless, it’s new types of incidents that would underscore new capabilities of AI systems.

In January 2024, AI-generated nude images of Taylor Swift appeared on social media, attracting over 45 million views before their removal. The Report explains:

Generative AI models can effortlessly extrapolate from training data, which often include nude images and celebrity photographs, to produce nude images of celebrities, even when images of the targeted celebrity are absent from the original dataset. There are filters put in place that aim to prevent such content creation; however, these filters can usually be circumvented with relative ease.

The Report also describes unsafe behavior in autonomous vehicles using machine learning controls, such as a Tesla in Full Self-Driving mode failing to stop for a pedestrian, as a part of a broader pattern of risky behavior by autonomous vehicles.

Privacy and persuasion concerns are also showing up with romantic AI chatbots designed to mimic human relationships. A review by the Mozilla Foundation found that these chatbots tend to elicit and collect extensive personal and sensitive information and often lack adequate data protection.

In sum, however, these incidents aren’t the unique expressions of new capabilities. While we should keep looking at this area in the future, there is no clear evidence in the Report of new emergent capabilities from AI in 2023.

Some Thoughts

Back in 2018, OpenAI posted on its website that “it’s worth preparing for the implications of systems far outside today’s [2018] capabilities” and that “Past trends are not sufficient to predict how long the [AI exponential] trend will continue into the future, or what will happen while it continues.”

That prediction, based on the scaling laws, has remained prescient for 5 years. If 2023 didn’t see incontrovertible evidence that performance has increased due to larger and larger models, since the release of the Report, new GPT4v, GPT4o,and Anthropic’s Claude Opus have provided more evidence that the scaling trends are continuing, at least to some degree. Especially if GPT-5 and Gemini 2 are released this year, the 2024 Report may be much more interesting.

It’s worth adding to this post that significant members of OpenAI’s safety team quit the company in protest last week. They were explicitly distrustful of OpenAI’s corporate priorities and quit out of frustration with the company’s lack of resources and attention to their safety team.

It is very difficult to assess the trajectory of AI capabilities. However, the idealistic 2018 OpenAI post continued, “even the reasonable potential for rapid increases in capabilities means it is critical to start addressing both safety and malicious use of AI today.”

人工 Legal

Discussion about this post