The 2024 AI Index Annual Report, spanning 500 pages, details AI trends across various domains including technical, public, governance, geopolitical, scientific, economic, commercial, educational, and ethical areas. With metrics predominantly soaring and the U.S. AI industry maintaining its dominance, it prompts the question: Are we experiencing peak Silicon Valley hype, or are these tools genuinely benefiting people?
In contrast with the bulk of the Report’s potential hype, examined in Part 1, consistent human adoption and actual reported benefits from such use provide a more tangible measure of AI's utility. This article, Part 2 draws from the Report’s discussions on Technical Performance, Public Opinion, and the Economy to explore these aspects. (The next part will look at aspects of the Report evidencing increases in AI capabilities independent of human adoption or enhancement.)
We can measure actual AI usefulness in two ways: technical benchmarking against human performance and rates of human adoption. In other words, we can assess how AI measures up by what it can do and whether it is actually being persistently adopted.
AI technical benchmark tests are struggling to keep up with AI outputs. The results appear increasingly hard to interpret—the tests are saturated by AI systems being taught to beat the tests. The resulting AI systems appear to reach superhuman levels on the benchmark, peak, and then stagnate. More persistent human adoption, on the other hand, is a clear indication that AI is useful. As are reports of the tangible benefits of such adoption.
Surpassing Human Performance
The Report describes AI progress against human benchmarks like this:
As of 2023, AI has achieved levels of performance that surpass human capabilities across a range of tasks. Figure 2.1.16 illustrates the progress of AI systems relative to human baselines for nine AI benchmarks corresponding to nine tasks (e.g., image classification or basic-level reading comprehension). The AI Index team selected one benchmark to represent each task.
Over the years, AI has surpassed human baselines on a handful of benchmarks, such as image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021. As of 2023, there are still some task categories where AI fails to exceed human ability. These tend to be more complex cognitive tasks, such as visual commonsense reasoning and advanced-level mathematical problem-solving (competition-level math problems).
The Report benchmarks go into significantly more details in the many subchapters of Chapter 2: (2.2) evaluating AI's understanding and generation of language, focusing on its ability to comprehend complex constructs and produce coherent responses; the factuality and truthfulness of language outputs produced by AI systems; (2.3) coding capabilities are scrutinized to assess AI's ability to generate functional programming code; (2.4) in the image domain, AI's proficiency in generating and processing images, following complex visual instructions, and performing editing tasks is explored; (2.5) video-related benchmarks assess AI's capabilities in video generation and understanding; (2.6) reasoning is evaluated comprehensively, covering AI's performance in general, mathematical, visual, moral, and causal reasoning tasks; (2.7) audio generation is examined through AI’s ability to create and manipulate sound; (2.8) agent capabilities, both general and task-specific, are evaluated to determine how well AI systems perform in autonomous settings; and (2.9) robotics research tests AI's integration into robotic systems
Collectively, the tests indicate both the progress and the areas needing further work; an overview of AI's current capabilities and potential future directions.
However, the results of these tests are also increasingly hard to interpret. The tests are saturated by AI systems being taught to beat tests. The resulting AI systems appear to be reaching impressive levels, peaking, and then stagnating—while also surpassing every domain we throw at them. It’s a paradoxical impression, and not at all satisfying to the underlying question: what can AI do and what is beyond its abilities? For now, all we know is that capabilities are advancing, but the question of how far is just out of reach.
Competing for Human Preference
Human preference evaluation is also considered in Chapter 2. With generative models producing text, images, and more, subchapter 2.2 describes how benchmarking is starting to include human preference evaluations such as the Chatbot Arena Leaderboard, instead of abstract test rankings. Chatbot Arena Leaderboard allows users to prompt two anonymous AI systems and vote for the preferred generative response. The Report states “As of early 2024, the platform has garnered over 200,000 votes, and users ranked OpenAI’s GPT-4 Turbo as the most preferred model.” Where testing can provide ambiguous results, public sentiment about the different outputs of AI systems cannot, and is becoming important in tracking AI progress.
In Chapter 9, on public opinion, the data shows public recognition and usage of ChatGPT. An international survey from the University of Toronto suggests that 63% of respondents are aware of ChatGPT. Of those aware of ChatGPT, about half (~31.5%) of the public report using ChatGPT at least once weekly.
Other findings suggests that LLMs are being broadly used in scientific writings (but is especially hard to measure). From a study outside the Report, Andrew Gray, at University College London, analyzed scientific papers for terms identified as words disproportionately used by chatbots, such as “intricate” “meticulous” and “commendable”. His findings, pending peer review, suggest that at least 60,000 scientific papers published last year may have utilized a Large Language Model (LLM). This represents slightly more than 1% of all scientific publications globally for the year. Another study in specific scientific fields found even higher usage rates, with one investigation indicating that up to 17.5% papers published between January 2020 and February 2024 in computer science exhibited signs of AI assistance in writing. Mathematics papers showed the least LLM modification at 6.3%.
In Chapter 4 of the Report, focusing on economic impacts, the integration of AI in business operations is confirmed and significant benefits in profit and productivity are being demonstrated, according to recent surveys and studies.
The Report incorporates data from McKinsey’s “The State of AI in 2023: Generative AI’s Breakout Year” which claims the adoption of AI within organizational structures is on the rise: 55% of organizations now utilize AI in at least one business unit or function. This represents a significant increase from 50% in 2022 and just 20% in 2017. The financial services industry was leading adoption among industries. The most commonly adopted AI use cases were contact-center automation (26%); tailoring products, services, content, recommendations, and marketing (23%); customer acquisition (22%); and AI-based enhancements of products (22%). Further, 42% of organizations have experienced cost reductions as a result of implementing AI technologies, including generative AI, and 59% of the entities surveyed by McKinsey observed revenue increases. The survey also notes a 10% increase from the previous year in organizations reporting decreased costs, indicating that AI is actually contributing to business efficiency.
Moreover, many studies conducted in 2023 positively assessed AI's impact on the workforce:
[A Stack Overflow Developer Survey] asked about the primary advantages of AI tools in professional development, [and] developers responded with increased productivity (32.8%), accelerated learning (25.2%), and enhanced efficiency (25.0%)… A significant majority of developers hold a positive view of AI tools, with 27.7% feeling very favorably and 48.4% favorably inclined toward them. Only 3.2% express unfavorable opinions about AI development tools
A Harvard Business School study revealed that consultants with access to GPT-4 increased their productivity on a selection of consulting tasks by 12.2%, speed by 25.1%, and quality by 40.0%, compared to a control group without AI access.
National Bureau of Economic Research research reported that call- center agents using AI handled 14.2% more calls per hour than those not using AI (Figure 4.4.21).
A study on the impact of AI in legal analysis showed that teams with GPT-4 access significantly improved in efficiency and achieved notable quality improvements in various legal tasks, especially contract drafting.
These studies suggest that AI actually enables workers to complete their tasks more efficiently and with higher quality output.
All of these examples, from various domains of the Report, are good evidence of AI's current actual upward trajectory as a transformative tool. This groundwork sets the stage for the final installment of the series, which will shift focus from measuring AI as a useful tool to human-independent AI capabilities.