Newsletter Archive
Listen to Our Podcast
Dear Aventine Readers,
If you've been paying even half-hearted attention to AI news over the past few months, you're likely aware of the disconnect between company announcements of startling new AI capabilities and a noticeable yawn from users when they try out the latest models. The word "plateau" has been used to describe the current state of AI. In this issue we dig into what's causing this discrepancy and discover that much has to do with how progress is defined and assessed. AI promises to be many things to many people, and those many capabilities are hard to quantify. At the same time, certain AI capabilities are proceeding at almost startling speed.
Also in this issue:
Thanks for reading!
Danielle Mattoon
Subscribe
Subscribe to our newsletter and be kept up to date on upcoming Aventine projects
Is AI Getting Better or Not?
Something doesn’t feel quite right about progress in artificial intelligence.
According to certain measurements, AI capabilities are advancing rapidly. Yet users, investors and even AI researchers increasingly report that progress is stalled. "We've entered a new phase where progress in chatbots is starting to top out," tweeted Stephen McAleer, an AI researcher at Anthropic, in August.
This is more than just an annoyance for impatient consumers. Hundreds of billions of dollars are being spent on AI and if those who use it aren’t noticing the progress they’ve come to expect, that’s a problem. So, is AI getting better or not?
To answer this question Aventine spoke with researchers who study AI development rates. "Hard and painful," is how David Cox, VP for AI models at IBM Research, described the effort to measure AI progress. The challenge comes from the seemingly infinite approaches to measuring different aspects of the technology and the lack of clarity over which assessment is the most meaningful. Better math scores? More persuasive writing? Automation of tasks that save companies money? Making someone’s job easier? Each definition demands different metrics, and there's no broadly agreed-upon gold standard for measuring progress.
It’s like the parable of blind men examining an elephant, said Marc Warner, CEO of Faculty, a London-based firm that advises organizations on AI deployment. Each man touches a different part and concludes they're feeling a different animal. “We find it extremely hard to have the perfect measure of how [AI models] are progressing,” he said.
Experts who spoke with Aventine agreed that for now it was important to measure progress using a variety of approaches. And by and large, two broad categories of tests — called evaluations or evals for short — have emerged: exam-style assessments with objectively right or wrong answers, and messier real-world assessments that attempt to capture practical utility.
Exams for AI
A large and growing battery of standardized tests measure AI competency across a wide range of domains. Popular benchmarks include GPQA (PhD-level science questions), FrontierMath (math problems that take experts hours to solve), SWE-bench (software engineering challenges) and DeepResearch Bench (high-grade research and analysis problems). But that is just four out of hundreds or even thousands.
On all these benchmarks, the more advanced AI models typically achieve higher scores than their predecessors. But these evaluations also have significant limitations. Models eventually "saturate" benchmarks by scoring near-perfect scores, requiring ever harder tests and making it difficult to compare results over time. There's also variation in how tests are administered and the results they produce, said Cox, and little agreement on how to synthesize hundreds of metrics into a holistic picture. The tests can also leak out into training data sets — essentially revealing the answers and making the tests irrelevant. In other cases, tests can become a target that companies specifically train their models to perform well on, which means models may underperform on tasks unrelated to material in the tests.
Finally, the evaluations overwhelmingly focus on math, science and software development. That’s partly because problems in these areas have clearly right answers, and “well-defined tasks with actual right and wrong answers are the obvious place to start," said Warner. But several experts also pointed out that this sort of intelligence also represents facets of human endeavor that certain AI researchers tend to value the most. This narrow focus may hide progress — or lack thereof — in other domains.
Taking on longer tasks
The most compelling evidence for continued progress in AI comes from an organization called METR, a nonprofit research center named for Model Evaluation and Threat Research, that evaluates the capabilities of AI. One of its methods to assess AI measures something others tests don’t: the ability of AI to accomplish tasks that take human software developers differing amounts of time to complete. (Task length here can be thought of as a proxy for the complexity of a task, though METR doesn’t itself make that assertion.) The assessment works as follows: Software developers complete a series of complicated tasks, and then an AI model is given the same tasks. When a model’s success rate on a task falls below 50 percent, the time it took humans to do the task is recorded as the model’s score.
Over time, results of the METR tests have shown a striking trend: The duration of the tasks AI can complete 50 percent of the time is doubling every four to seven months, far faster than the two-year doubling of the number of transistors able to fit on a microchip as defined by Moore's Law. In 2023, GPT-4 was completing tasks that took 5 minutes of human time. At the end of 2024 OpenAI’s first reasoning mode, o1, succeeded at tasks that took humans 39 minutes. Today, GPT-5 can complete tasks requiring 2 hours 17 minutes of human work. If the trend continues at this pace, by 2026 AIs will be able to perform tasks that require software developers a full day to complete, and by sometime in 2028 AIs will be completing tasks that take humans a full working week.
This metric has limitations. It still focuses narrowly on software development, and task duration isn’t a direct analog to complexity: Even if the test extrapolates outside of software development, a two-hour bureaucratic task isn't necessarily more difficult than a 30-minute math problem. Yet its open-ended nature allows consistent comparison across models, and the exponential trend provides a means of predicting what might come next. Several experts suggested the METR measurement may — like Moore’s Law — capture something fundamental about AI progress, even if we don't fully understand what that is yet.
Still, "we shouldn't necessarily assume it continues forever," Warner cautioned.
The perception problem
Given these technical gains, why aren’t users consistently experiencing a corresponding improvement? Warner suggests that one reason may be that humans struggle to assess the quality of work produced by AI if they lack domain expertise. As models approach competence on everyday tasks, improvements become less perceptible.
Another factor: priorities. Many researchers in AI labs are now focused on building models capable of doing AI research that could accelerate R&D dramatically. "For the hard sciences, where you can verify the answer, you can optimize these models much faster," said José Hernández-Orallo, a professor at the Valencian Research Institute for Artificial Intelligence. "And that means we see less progress in some other areas."
Yet Cox suggests that consumer experience might be a distraction. "People over-index on the consumer experience," he said. "A chat window on a web page [is] not the right way to index how that model is going to behave when you're actually doing an enterprise task."
Ultimately, the gap between an AI model that can ace a math test and consumers who are underwhelmed by chatbots highlights an uncomfortable truth: So far, we are not seeing that AI models capable of performing well on tests are consistently achieving real-world impact.
An economic reality check
AI's success will depend on whether its progress delivers value: whether it makes work cheaper and easier. "It's kind of critical whether AI systems actually give productivity an uplift," said Greg Burnham, a senior researcher at Epoch AI, a nonprofit research institute that tracks AI trends.
The economic evidence so far is … sobering, to say the least. The most frequently cited study — MIT's Project NANDA report (published in July) — claimed that 95 percent of generative AI projects fail to provide return on investment. (Critics of the study attributed this to poor implementation rather than AI limitations.) Additionally, METR ran a randomized controlled trial on experienced software developers and found they took 19 percent longer to complete tasks when using AI tools. Whether this reflects genuine AI limitations or methodological issues remains unclear, especially given that many companies have claimed AI helps developers to write and deploy code faster.
OpenAI has developed its own economic benchmarks. In February, it released SWE-lancer, in which models attempt to "earn" money by completing tasks taken from the freelance marketplace Upwork. (They tackle the tasks, but as a test — this isn’t real work for actual employers.) So far, models are able to earn up to about 40 percent of the virtual fees on offer. Then in September the company released GDPval, designed to measure performance on economically valuable tasks across 44 occupations. So far, the most advanced models approach, but have not yet surpassed, human-level ability, with Anthropic's Claude Opus 4.1 model scoring highest. OpenAI concedes that GDPval is still a relatively rudimentary means of assessing the real-world contributions that AI can make as it “doesn't reflect the full nuance of many economic tasks."
Industry-specific benchmarks — like IBM's AssetOpsBench for IT tasks and Salesforce's CRM benchmark — attempt more targeted assessments. But concrete evidence that AI is transforming productivity is elusive. Substantial real-world impact would likely translate into economic statistics, such as spikes in unemployment, that have yet to materialize.
The scaling question
Continued improvement as measured by evaluations and METR’s benchmark so far has resulted from scaling: larger models, more computation, bigger training datasets. Yet as OpenAI CEO Sam Altman has acknowledged, scaling provides diminishing returns. Each improvement in capability requires roughly ten times the resources of the previous one. Resources must increase exponentially to maintain linear progress.
Whether scaling will continue delivering gains isn't guaranteed, and views in the field vary from “absolutely yes” to “absolutely not.” What would definitely halt progress: insufficient power, chips, or data to keep up the pace of scaling. With adequate funding, according to Epoch AI, scaling should be able to continue through 2030, enabling training runs 10,000 times larger than the estimated scale of GPT-4's. If that happens, insufficient power would be the first constraint to potentially slow progress.
That "adequate funding" caveat is significant: Previously, companies secured capital to train new models before current ones launched, Burnham noted. As costs grow, investors may demand proven results before reinvesting, potentially lengthening development timelines.
What comes next
So is progress slowing? According to experts Aventine consulted: not yet. They said that a slowdown would mean evaluation scores plateauing and METR task horizons flattening, neither of which has happened. The real nail in the coffin, several said, would be if no clear ROI emerged over the next year or two.
"Progress is continuing more or less on trend," said Daniel Kokotajlo, lead author of AI 2027, a project by the nonprofit AI Futures Project that mapped potential AI scenarios, in an email. In his estimation, the perception among users of slowing progress from GPT-5's launch, which made New Yorker headlines in August, "wasn't as big a deal as many seem to think."
The important question to ask might not be whether AI progress is slowing, but instead whether the kind of progress we're seeing will bring the transformative impact that justifies hundreds of billions in investment. So far, we haven’t observed that, or worked out how best to measure it. Whether the gap between investment and return closes or widens will determine if we're witnessing a boom or a bubble.
Listen To Our Podcast
Learn about the past, present and future of artificial intelligence on our latest podcast, Humans vs Machines with Gary Marcus.
Advances That Matter
Facility where 86 million gallons of radioactive and chemical materials from the world’s first plutonium production reactor in Benton County, Washington, will be entombed. Department of Energy
America's historic nuclear weapons factory is cleaning itself up. The world’s first plutonium production reactor at the Hanford site in Benton County, Washington, was the crucible of America's 20th century nuclear arsenal and a pivotal part of the Manhattan Project and the Cold War. Now, nearly 20 years later than promised, Science reports, a facility at the site has begun entombing some of the 86 million gallons of radioactive and chemical waste at the site inside glass. Approximately 90 percent of the waste — currently held inside 177 underground tanks, some of which have corroded and leaked — is liquid, and will be mixed with molten silicate and other additives to create inert glass cylinders, a process known as vitrification. While these are radioactive, they can be stored safely, without fear of leaks. The remainder of the waste, a thick sludge at the bottom of the tanks, will be processed in a facility still being built and whose construction stalled for over a decade. Getting to this stage has been expensive: More than $10 billion has been spent on the cleanup effort so far, and the full project, including the second plant, is expected to cost more than $200 billion and run into the 2070s. The Hanford site is also going to be put to other use over that time. Earlier this year, The New York Times reported that, as part of an independent project, a 10,300-acre section of the site is being redeveloped as a massive solar farm that by 2030 could generate as much as 2,000 megawatts of electricity, or enough to supply all the homes in Seattle, San Francisco and Denver combined.
A new eye implant helps people with sight loss to read again. In a clinical trial published in the New England Journal of Medicine, a wireless retinal implant called PRIMA restored some central vision in people blinded by geographic atrophy, an advanced form of dry age-related macular degeneration that affects 1 million people in the US. The 2-mm-square wireless, solar-powered chip is implanted under the retina and works with glasses that capture a scene and send it to the implant. The chip converts the signal into tiny electrical pulses that the retina relays to the brain. The system operates in the infrared spectrum, a kind of light that humans can’t normally see, so as not to interfere with any existing vision. Participants in the study all had vision in one eye worse than 20/320 — meaning they could see at 20 feet what a person with normal vision can see from 320 feet. Of 32 people who received the implant and were followed up 12 months later, 27 were able to read again, with an average improvement of approximately five extra lines on an eye chart. The extra vision provided by the system is not comparable to what most people reading this likely have. It’s black and white, blurry, covers only a small area and people need training to use it well. The researchers plan to improve the system so that people can see in gray scale instead, according to New Scientist, as well as increase the resolution of the chip to improve the clarity of sight further. The device is being developed by Science Corporation, which has applied for a European license to sell the device and says it’s in discussion with US regulators.
Gene editing is finally heading for mitochondria. CRISPR gene editing has transformed modern biology, but one pocket of genetic code has stayed off-limits: the DNA inside mitochondria. These energy-producing organelles inside cells are wrapped in tough membranes that block the RNA molecules that enable CRISPR to make its edits. Yet mitochondria are highly prone to mutations that give rise to more than 300 incurable genetic mutations related to disorders affecting roughly 1 in 5,000 people globally. While techniques are being developed to help parents avoid passing mitochondrial mutations on to children, gene editing would allow cases of mitochondrial diseases that develop later in life to be treated too. Instead of using standard CRISPR, researchers have developed a technique using repurposed toxins, which usually damage DNA, to change single DNA letters inside mitochondria, Nature reports. They've also figured out how to get those toxins inside the mitochondria aboard proteins, rather than the bigger RNA molecules used by CRISPR. In mice, scientists have shown they can install disease-causing mutations to model illness and, in some cases, subsequently correct those mutations after birth using the same editing approach. That alone will help researchers understand mitochondrial disease better and test treatments more easily. But if the technique can be shown to be safe and accurate enough, gene editing treatments for mitochondrial diseases could one day be used in humans too.