Newsletter Archive
Listen to Our Podcast
Dear Aventine Readers,
Time for another newsletter about AI!
It’s been just about two years since the debut of ChatGPT, the first large language model widely available to the public and — for better and for worse — the default face of modern artificial intelligence. Since then, updated ChatGPT models have been able to produce language that is more fluid and answers that are more accurate than previous versions have. But there remains a widely acknowledged and glaring weakness in these models: They can’t reason very well. They sort through many billions of words and sentences to produce plausible arrangements of text in response to questions. That is not the same as building on a base of knowledge to reach answers to difficult questions through a series of logical and justifiable steps. But this could be changing. Today we’ll look at some of the latest breakthroughs in AI’s ability to reason and what that means for machines going forward.
Also in this issue: A new treatment successfully reversed Type 1 diabetes in a single patient and is being tried on others; small modular nuclear reactors will soon be part of the U.S. energy mix, thanks to the growing energy needs of Big Tech; and there’s growing evidence that what we eat affects how we respond to illness.
As usual, thanks for reading!
Danielle Mattoon
Executive Director, Aventine
Subscribe
Subscribe to our newsletter and be kept up to date on upcoming Aventine projects
AI Can Reason, Except When It Can’t. Where Do We Go From Here?
Let’s start with a question: If a sperm from species A is injected into an egg from species B and both species have the same number of chromosomes, what would be the main cause of the resulting zygote mortality?
Don’t worry! Most people can’t answer that. It’s a question from an artificial intelligence benchmarking test called the Graduate-Level Google-Proof Q&A Benchmark, written by experts in biology, physics and chemistry. Even people with or pursuing PhDs in the relevant fields only score, on average, 65 percent. And the most advanced version of OpenAI’s GPT-4 model only scored 53 percent. But a new contender has entered the arena and beaten them all. In September, OpenAI reported that its new o1 model scored 78 percent on the test. And it did this, according to the company, by performing “complex reasoning,” meaning that it moved from knowledge of certain underlying facts toward a final answer without accessing any external sources for more information. The hitch is, when confronted with other benchmarking tests from outside sources, o1 floundered. One test, called the Abstraction and Reasoning Corpus, requires test subjects to complete a visual puzzle that involves identifying a pattern in a small number of examples before using that pattern to answer a final question. You can try it here. Average human performance on that test is between 73.3 percent and 77.2 percent. OpenAI’s o1 model? A mere 21 percent.
“[Current AI models] are able to reason, at least in the sense that they are able to combine different sources of knowledge … to reach some conclusions that are not obvious,” said José Hernández-Orallo, a professor at the Valencian Research Institute for Artificial Intelligence. “[But while] they are very capable of solving very complex problems … they can fail at very easy things.”
This disparity in the scores speaks to essential questions about artificial intelligence: What does it mean for AI to reason? Why do we want it to? How should we measure success? And what’s required to endow artificial intelligence with across-the-board reasoning skills?
Aventine spoke with Hernández-Orallo and other experts in artificial intelligence and cognitive science to understand how far artificial intelligence has come in its ability to reason. They were excited about recent advances and optimistic about what might come next. But they also described complexities around measuring AI’s abilities, and many of them argued that greater diversification in the underlying technologies of AI will be required for it to reliably perform tasks that require complex reasoning.
The quest to build a reasoning machine
Ask a roomful of psychologists to define what reasoning means and, if the interviews for this piece are anything to go by, you’ll receive a roomful of subtly different answers. By and large, the view of the experts who spoke with Aventine can be boiled down to something like this: Reasoning is the process of starting with some given information or observations, then deriving new and correct information from that starting point. It is a fundamental component of human intelligence and allows us to think beyond the confines of what we directly observe or have committed to memory.
Endowing artificial intelligence systems with this ability — which should not be confused with other cognitive traits such as self-awareness — is believed to be key to making them more consistently accurate and reliable. “We can't give [AIs] explicit knowledge or information about everything in our world, right? [There’s] too much, and we don't even know it ourselves,” said Melanie Mitchell, a professor specializing in complex systems and AI at the Santa Fe Institute. “So they have to be able to do some kind of generalization, and generalization involves some kind of reasoning.” To think about it another way: If we want AI to help us discover and solve new problems, that’s “going to require AI systems that are capable of handling and dealing with uncertainty, new situations, novelty, [and] things well beyond their training examples,” said Mike Knoop, a cofounder of the automation software company Zapier and also of the ARC-AGI Prize, a $1 million competition awarding prizes to teams that excel at the Abstraction and Reasoning Corpus test.
It’s also important to note that the process of reasoning for machines may be very different than it is for humans. It’s “certainly possible” that artificial intelligence may end up performing reasoning tasks in very different ways to humans,” said Tobias Gerstenberg, an assistant professor in psychology at Stanford University.
While attempts to build software that can reason stretch back to the earliest days of artificial intelligence research, today’s large language models, or LLMs, have surprised many with their ability to work through complex problems. “I was very surprised to find when I first experimented with ChatGPT that if you type in an undergraduate-level problem in calculus, it would solve it almost 100 percent of the time,” said Professor Yang-Hui He, a mathematical physicist who works with AI tools and is a fellow at the London Institute of Mathematical Sciences. Yet, he pointed out, these sorts of GPT models haven’t learned to do math by breaking a problem down to its distinct components and taking a methodical step-by-step approach. Rather, they have seen hundreds or thousands of math problems from the internet and identified patterns that allow them to predict what an answer should be with unnerving accuracy.
LLMs have also shown themselves to be capable of non-math tasks that require reasoning. They can, for instance, solve cipher puzzles that require letters to be shifted a particular distance in the alphabet, or questions such as “If Alice's father has five daughters and four of them are called A, B, C and D, what is the name of the fifth daughter?” (It’s Alice, by the way.) Yet a 2023 study by researchers from Princeton University showed that LLMs performed far better at solving these kinds of problems when the underlying data — say, the sentence it needed to decode using a cipher — was commonly found online and therefore included in its training set. GPT-4’s accuracy at decoding a shift cipher, for instance, was 51 percent when the answer had a high-probability sequence of words and just 13 percent when it was a low-probability sequence. The ability of this kind of AI to reason, then, remains wholly dependent on what has been seen in the past.
The biggest improvement in five years
OpenAI’s o1 model is an explicit attempt to improve the reasoning performance of large language models. The company has provided little technical explanation of how o1 works, and didn’t respond to an interview request for this newsletter, but there are a couple of key differences between it and its GPT stablemates. First, it uses an approach known as “chain of thought” to break down problems into smaller chunks that it solves sequentially. This is an approach people have used informally with GPT models by adding instructions such as “explain how you decided on each step of your answer” to encourage more accuracy. Second, o1 is trained using reinforcement learning — a process in which the model is rewarded for providing correct answers and penalized if it provides incorrect ones — to assess how “well-reasoned” a response is.
The results are a significant step forward from the reasoning capabilities of OpenAI’s previous models. As previously described, it outperformed PhDs on the Graduate-Level Google-Proof Q&A Benchmark. It also solved 83 percent of problems on a qualifying exam for the International Mathematics Olympiad, compared to GPT-4o, the most advanced of OpenAI’s other models, which solved just 13 percent. Many scientists who have used the tool are impressed by the software, according to Nature. “It is the biggest genuine improvement in generalization available in a commercial AI model, going all the way back to GPT-2, which came out in 2019,” said Knoop. (He said this even though o1 struggled on the ARC-AGI test his prize is based on.)
Yet o1’s low score on ARC-AGI shows that it’s not a universal reasoning engine. There are, of course, potential issues with any benchmarking approach: Just as there is no universal test for human intelligence, neither is there a universal test for AI reasoning. Still, such benchmarks serve a purpose. “From an advancement standpoint, you always want to be setting marks that are just outside the bounds of capability,” said Knoop. “What's the next easiest thing that you know these computer systems can't do? That would move us in the direction of general intelligence.”
The o1 model is also significantly more expensive to use, and likely more computationally demanding, than its stablemates. While OpenAI has not shared details about the computational costs of training or running the o1 model, the prices it charges to users are considerably higher than those for use of GPT-4. Some users have reported that tasks can be ten times more expensive when executed using o1, a factor that is likely to limit the use of the model only to tasks for which more complex reasoning is required.
There are many potential paths to imparting better reasoning to machines. OpenAI, considering the name it gave its o1 model, seems to be suggesting that new iterations are forthcoming (and the company has said that elements of the o1 architecture will likely be used in its next big model, GPT-5), but there are also completely different approaches under consideration. Two of the most noteworthy are being developed by DeepMind: AlphaProof and AlphaGeometry, both designed to solve complex math problems. Rather than being trained to operate with words, as LLMs are, these models are trained to work with a machine-readable code that represents mathematics, called Lean. That represents a different approach to machine reasoning: the use of non-text-based knowledge to tackle problems. Under the hood, the model does two other things that Knoop thinks will be particularly important to building better reasoning AIs in the future: It searches through possible solutions to determine correct paths of reasoning, and also assimilates correct answers back into the model, improving its base of knowledge for future problem-solving.
Professor He points out that mathematicians are racing to translate all of the world’s mathematical proofs into Lean, in a process known as formalization, but is is a painstaking task that will take years; even a proof of the Pythagorean Theorem requires 100,000 lines of code. If a platform like AlphaProof could one day ingest the entire corpus of mathematical knowledge, though, it would most likely be able to reason methodically through any mathematical question, making it a powerful, if highly domain-specific, tool.
The impending AI mashup
Ultimately, though, building an artificial intelligence system that can reliably reason might require an entirely new approach — one that combines LLMs’ increasing ability to generalize with more specific tools, such as those that are part of AlphaProof.
Gary Marcus, a cognitive scientist and AI entrepreneur, has previously argued that “we cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning." Knoop and Conrad Wolfram, the European CEO of Wolfram Research, which builds software and AI to perform mathematics and computation, agree. “The hybrid approach seems to us important,” Wolfram said, pointing out that the probabilistic natures of LLMs make them well-suited to “fuzzy” problems, but more rules-based approaches that build on math could provide more efficient and accurate answers to computational problems.
But of course the means won’t matter if the results are wrong. “In the end, I think you want decisions and answers that are the right ones,” said Wolfram. Like the answer to the question about sperm that we started with: The main cause of the resulting zygote mortality would be “epistatic interactions between the genes of different species.”
Listen To Our Podcast
Learn about the past, present and future of artificial intelligence on our latest podcast, Humans vs Machines with Gary Marcus.
Advances That Matter
Rendering of the Hermes low-power demonstration reactor under construction in Oak Ridge, Tennessee. Image courtesy of Kairos Power
Small modular nuclear reactors have a chance to prove themselves. Google, Amazon and the U.S. government all made significant commitments in the past month to developing and deploying small modular reactors, or SMRs, at a scale that could help transform the nuclear power industry in the U.S. SMRs generate up to 300 megawatts of power, or about a third of what a large nuclear reactor produces, and are somewhere between a tenth and a quarter of the physical size. They are also designed to be factory-assembled and delivered to site, theoretically making them easier and cheaper to manufacture and install. But the technology is still in its infancy; currently there are only two SMRs operating commercially anywhere in the world, in China and Russia. Critics argue that smaller reactors might not achieve the efficiencies of larger plants, but Amazon and Google — both desperate to find new, clean sources of energy to support their ever more power-hungry data centers — are pushing forward. (Data centers currently consume up to 4 percent of all the electricity annually generated in the U.S.) Amazon has announced that it is committing more than $500 million to the technology, an investment that includes developing four SMRs with public utilities in the Pacific Northwest, a separate investment in an SMR company called X-energy, and a project exploring the development of an SMR project in Virginia. Google, meanwhile, will buy power from at least six SMRs for an undisclosed sum from another SMR startup, Kairos Power, to provide 500 megawatts of generation capacity (about enough to power up to 500,000 homes), with the first facility expected to come online by 2030. The U.S. government has also opened applications for up to $900 million in funding to support the deployment of SMRs across the U.S. This is potentially a make-or-break moment for the SMR industry, providing it with the capital and opportunity to finally demonstrate whether it is a commercially viable technology.
Stem cells can reverse Type 1 diabetes. In June 2023, a woman who received a transplant of reprogrammed stem cells originally taken from her own body started producing insulin within three months of the procedure and, more than a year later, still does not require external sources of insulin to control her blood sugar levels. That work, published in the journal Cell, describes a first-of-a-kind trial undertaken by researchers at Tianjin First Central Hospital in Tianjin and the Peking University in Beijing, both in China. The results offer a potential path toward a medication-free future for sufferers of diabetes, a disease that is estimated to affect almost 500 million people worldwide; it could be particularly useful for those with Type 1 diabetes. The process involved removing fat cells from the patient, inducing those cells into what's known as a pluripotent state, which means a state from which they can develop into any type of cell in the body, and then converting them into insulin-producing cells called islets. The islets were then tested for safety in animals before being injected into the abdominal muscles of the patient. Within two and a half months, the woman in the study was producing sufficient insulin to obviate the need for external doses and her body has continued to produce adequate amounts of insulin for more than a year. Since the trial started, two more patients have received the same treatment, but no details about their responses have yet been published. Their experiences will be crucial to resolving one sticking point in the first study: The first patient was taking immunosuppressant drugs for a separate health issue, which could have protected her from a negative response to the injected cells. The approach also has some competition; other groups, including Vertex Pharmaceuticals, are experimenting with using donor stem cells to achieve similar results, an approach that would definitely require use of immunosuppressants.
Yes, your diet does change your immune system. Researchers are finally proving once and for all that the food we eat has a profound impact on the way we respond to illness. This story from Nature rounds up the early evidence from what is an emerging field of research, enabled by the fact that scientists can now analyze the body’s often rapid response to changes in diet more accurately than ever before. A trial in mice, followed by a small trial in humans, showed that when cycling between a low-fiber/high-fat diet and a normal diet for three days each, the switch to the low-fiber/high-fat diet suppressed the production and function of some T cells, the blood cells that fight infection and disease. Another trial undertaken this year observed 20 adults who had agreed to a four-week hospital stay during which they received strictly controlled diets: A low-carb ketogenic diet for two weeks followed by a low-fat vegan diet for two weeks, or vice versa. Both diets created differences in the immune system. The vegan diet heightened the body’s innate immune responses, which are the immediate first defense against pathogens; the ketogenic diet boosted levels of T and B white blood cells, which identify new viruses or bacteria and develop responses to fight them off. Other studies have looked at the benefits of specific foods or the impact of fasting, and many of them reveal an observable impact on immunity. It is early days in the field of diet-related immunity research — which is particularly difficult to conduct outside of controlled settings because it is so hard to track exactly what people eat — but evidence suggests that further study of the connection between diet and immunity will be worthwhile.
Magazine and Journal Articles Worthy of Your Time
AI has dreamt up a blizzard of new proteins. Do any of them actually work? from Nature
1,800 words, or about 8 minutes
Hot on the heels of half a Nobel prize being awarded to Demis Hassabis and John Jumper for their work on AlphaFold2, a tool that predicts protein structures from their chemical compositions, comes this story from Nature probing how that tool is actually being used. The understanding and design of proteins has been made far more accessible in recent years through the development of AI tools like the ones created by Hassabis and Jumper, and in theory that might lead to an explosion in innovation that allows labs to identify and test new proteins much faster. What has happened instead has been the design of hundreds or thousands of potential new proteins for some applications, each of which must be assessed for potential utility — a time-consuming task researchers were not prepared for. It’s not exactly one step forward and two back, but it is a reminder that so-called game-changing tools can bring their own complexities to the process of innovation.
The weeds are winning, from MIT Technology Review
2,800 words, or about 11 minutes
According to the International Herbicide-Resistant Weed Database, there are now 273 species of herbicide-resistant weeds. And some of them are, quite literally, a huge problem for farmers. One particularly pernicious species, Palmer amaranth, grows as much as two inches per day to a height of eight feet, and can take over entire fields, reducing crop yield to zero in some cases. This article from MIT Technology Review explains that the path forward will have to be multi-pronged. While there is potential to use genetically targeted weed killers that affect only specific plants, there are other technological solutions available too: sterilization of pollen in plants to halt their spread, drone-informed targeted herbicide spraying and AI-powered robots that spot weeds and zap them with lasers. The problem is that these approaches are currently expensive, forcing farmers to think long-term to justify the investment. The alternative is a spraying weed killer every year, which exacerbates the problem over time.
Why Is It So Hard to Go Back to the Moon? From Scientific American
3,400 words, or about 14 minutes
It's been more than 50 years since humans set foot on the Moon. That could soon change: NASA’s Artemis II mission, scheduled for 2025, will send astronauts around it, with future missions planned to actually land on the surface. Why has it taken so long? This story digs into the many factors at play: funding for NASA is dramatically lower than it used to be, making annual budget for the missions tighter; safety standards have improved, meaning engineers must work harder to simulate possible scenarios and perform more tests; technologies have become more complex, making manufacturing and design more difficult; and we’re no longer in a Cold War-fueled space race, so there is less urgency. All told, it’s a combination of factors that have dramatically slowed NASA down. And, as the article points out, other public and government infrastructure projects around the U.S., from refurbishing subways to the construction of aircraft carriers, face many of the same problems.