Newsletter / Issue No. 12

Image by Ian Lyman/Midjourney.

Newsletter Archive

May 2024
navigation btn

Listen to Our Podcast

Dear Aventine Readers, 

Another month, another look at how artificial intelligence is pushing limits we never knew we had. This time, it’s the amount of useful data in the world for training AI systems. Did you think there was an inexhaustible supply? There is not. We look at the types of data facing the greatest shortages and what companies are doing to compensate. 

Also in this issue: Graphene, a material that could potentially transform multiple industries, is getting renewed attention 20 years after it was discovered; a new way to design buildings to prevent catastrophic collapse during earthquakes; and a look at the crews who take care of all those cables crisscrossing the ocean floor. 

Thanks for reading! 

Danielle Mattoon
Executive Director, Aventine

Subscribe

Subscribe to our newsletter and be kept up to date on upcoming Aventine projects

The Big Idea

AI Faces an Information Shortage. Fake Data May Help.

There was a time when it seemed we were surrounded by an infinite amount of data. Large language AI models have turned that assumption on its head.

OpenAI’s first transformer-based large language model — GPT-1, released in 2018 — was trained on about 1 billion words or roughly 10,000 novels. Three generations later, GPT-4 is speculated to have been trained on 4.6 trillion words, or the equivalent of about 46 million novels. 

Epoch AI, an AI research institute, estimates that the data sets used to train today’s large language models, or LLMs, double in size roughly every 10 months. In 2022 they made a rough prediction about what that would mean for the supply of high-quality data that companies rely on for such training. The result, published in a paper describing the research, suggested the stock of high-quality language data would be exhausted “likely before 2026.” 

This predicament has attracted the full attention of big AI companies, who have been devising ever more creative — and questionable — ways to hoover up data, according to The New York Times. As DeepMind CEO Demis Hassabis recently told Wired, “there are still gains to be made” by increasing the size of models and the data sets used to train them.

That raises an existential question for AI: Is there enough data to keep up its current rate of progress?

To try to answer that, Aventine spoke with data and AI experts who said, in so many words, that the idea of AI running out of data is too generalized a concept to make much sense. Rather, it's the idea of AI running out of some specific types of data for some specific use cases that is a concern for companies of all sizes. So it’s better to think about specific AI models and the kinds of data they require, whether that’s large language models and the need for all kinds of written language or industrial AI systems that need more structured information gathered in huge tables. 

Then there’s the question of whether the data is usable. In the case of large language models, there’s “a lot of concern around the quality of the data,” said Kalyan Veeramachaneni, a principal research scientist in MIT’s Laboratory for Information and Decision Systems and cofounder of the synthetic data startup DataCebo. Historically, texts that have been carefully and deliberately written and edited — books, academic papers, newspaper articles, Wikipedia pages and so on — have typically been considered to be high quality. “Text is most valuable when it communicates new knowledge, illustrates rare skills, or is written in a low-resource language” for which there is a smaller body of written text, wrote Riley Goodside, staff prompt engineer at the synthetic data startup Scale AI, in an email to Aventine. 

It is the finite amount of this kind of data that is the biggest concern for companies building large language models. That said, the situation may not be quite so desperate as Epoch AI first predicted. On reflection, said Pablo Villalobos, a staff researcher at Epoch AI who was lead author of that study, the 2026 estimate was overly aggressive because the underlying assumption of what counts as high-quality data was too narrow. Instead, he said, AI developers have used broader swaths of data collected from the web to train their models, and “it turns out it works mostly fine.” Now, based on research that Villalobos and others at Epoch AI are currently working on for a follow-up to the original paper, he estimates that at the current rate of progress, high-quality data might be exhausted sometime between 2028 and 2030. “There's probably much more high quality than we thought,” he said. “But still not enough to last forever.”

Regardless of the exact date, the fact that a date exists helps explain why companies like Google and OpenAI, according to The New York Times, are looking to alternative sources for data. Most large language models are making use of Common Crawl, for example, a regularly archived copy of billions of web pages, even though the bulk of the data is unusable. (OpenAI took 45 terabytes from Common Crawl to train GPT-3, but used only about 1 percent of it after low-quality content was filtered out.) Google and OpenAI are exploring other data sources as well, from transcribing YouTube videos to scouring restaurant reviews on Google Maps. OpenAI is also busy striking multimillion-dollar deals for access to content with publishers such as the Financial Times and websites like Reddit, something “they wouldn't do if data was very plentiful,” said Villalobos. 

Yet Villalobos warns that for those companies at the frontier of foundational model development, there are other constraints — such as energy availability and the supply of computing hardware — that will start to limit progress on a similar timescale to data bottlenecks. Without building new power plants or chip production facilities fast enough, he argues, by the turn of the decade these companies may not be able to make use of larger datasets even if they have them. 

Data data everywhere, and not a drop to use

Away from the bleeding edge of AI development, there are other, smaller — but perhaps just as pressing — data shortage concerns: Businesses around the world are struggling with huge data bottlenecks when they attempt to use AI inside their organizations, often finding that they simply can’t source enough quality data to effectively train their own models. “Absolutely it's a concern,” said Kristina Podnar, senior policy director at the Data and Trust Alliance, an industry body focused on the adoption of responsible data and AI practices, whose members include American Express, CVS Health, GM, Nike, Walmart and others.

As artificial intelligence has become more accessible, in large part due to the rise of LLMs, companies are becoming impatient to make use of AI to automate tasks and processes. And most are sitting on vast troves of data that are theoretically ideal for use in training models. Guang Cheng, a professor of statistics and data science at the University of California, Los Angeles, points out that industries such as healthcare and finance in particular have huge structured datasets that describe in great detail the behavior and actions of their customers. Much of this is also held as tabular data, organized into rows and columns, that makes it particularly well suited to training some forms of AI. “I think one table is sometimes far more important than a million pictures,” said Cheng.

Yet it’s often difficult to take advantage of these huge datasets. One of the most significant challenges is that much of the data held in these records is highly sensitive and often regulated — whether that’s medical records, financial transaction histories or personal data protected by the EU’s General Data Protection Regulation. This severely limits how widely data can be used within an organization, explains Veeramachaneni, requiring onerous sign-off processes that cause long delays in building and training models. Privacy issues also present a huge problem if the data is being used to train generative AI because the resulting models can easily disclose information from the original dataset when they generate new content.

Veeramachaneni also points out that many of the most valuable features in a company’s datasets are the rarest: cases of fraud in a bank’s records, say, or the failure of a small component deep inside a wind turbine. These infrequent occurrences, or edge cases, are often most useful for training AI models so that they can spot potential problems in live data streams. The fact that such occurrences are so rare makes it difficult for algorithms to learn how to reliably identify them. 

The rise of synthetic data

To overcome some of these issues, many companies — both those building cutting-edge large language models and those seeking to apply AI in their businesses — are turning to synthetic data. This is data that is generated by software so that it closely mimics real data but has some advantages that real data lacks, like not including personal information, including adjustments that reduce bias or including more instances of rare events.

Synthetic data is not a new concept in AI; it’s been used since at least the 1980s, when synthetic images were used to train early versions of self-driving vehicles. But as AI has exploded in capability and availability, synthetic data has begun to move from experimental technology to commercial reality. The market for synthetic data is expected to grow from $380 million in 2022 to $2.1 billion in 2028.

Not all synthetic data is created in the same way. Some is built using relatively straightforward and explainable mathematical approaches, explained Adam Kamor, cofounder and head of engineering at the synthetic data startup Tonic.ai. For instance, probabilistic models can be used to create new data that follows the same statistical patterns as the original dataset while stripping out personally identifying information. More complex synthetic data sets are often generated using an AI model that is trained on the original dataset, explained Veeramachaneni, and can then be adjusted to generate datasets that contain more or fewer instances of specific features. The wide-scale access to large language models has meant that it is also now possible to quickly create unstructured text data, or some form of written language. LLMs can be used to create datasets that don’t exist and would be time- and labor-intensive to collect and clean from scratch — say, a list of tweets written in German about a series of companies. To facilitate this, there are tools such as those made by Tonic.ai that claim to be able to strip out personal information from datasets. 

Using synthetic data to train cutting-edge large language models is more challenging. While it is now easy to create vast quantities of new text using generative AI, caution needs to be exercised if that text is to be used to further train an AI model. This is because all of the information embedded in text newly generated by an LLM is already encoded within the model itself, explains Villalobos. In fact, a team of AI researchers including academics from the Universities of Oxford and Cambridge has shown that this approach can degrade the quality of a large language model and in some cases lead to catastrophic “model collapse,” in which the algorithm begins to produce nonsense. This happens because over time the process dilutes the information-rich edge cases originally encoded in the model. 

“Synthetic output from an LLM certainly shouldn’t be used indiscriminately to continue training the same model,” Scale AI’s Goodside wrote in an email. But he added that under the right circumstances — in controlled volumes and with human expert review — synthetic data can provide a much-needed boost to performance. “Synthetic data can be, and is, used to improve the quality of models,” he said.

Finding the right balance

There are potential headaches for any company looking to employ synthetic data, though, warned Ian Coe, cofounder and CEO of Tonic.ai. One is that it can be difficult to guarantee that a synthetic dataset is free from all personally identifiable data, because some of the machine learning approaches used to generate it can inadvertently internalize information. A related issue is that, according to Coe, synthetic data must make a “trade-off between data privacy and data utility,” and finding that balance — where the synthetic data is useful without breaching any privacy concerns — can be difficult.

“For many tasks, the diversity, domain knowledge, and creativity of human experts needed to train frontier LLMs can’t be automated,” wrote Goodside. “There are also cases where the best use of that expertise is in providing feedback, curating synthetic attempts, or writing ideal demonstrations in collaboration with an LLM.”

Still, for companies using AI and the makers of large language models alike, synthetic data offers a potential escape from many of the headaches associated with real data, whether that’s privacy or simply sourcing enough information to build a genuinely useful model. Whether or not it’s the best choice for any organization must be determined on a case-by-case basis, though — a risk-reward calculation that weighs a range of concerns including utility, cost, speed and the risk of something going wrong. “Ultimately, every time that an organization is using data, it's putting its reputation on the line,” said Podnar.

Listen To Our Podcast

Learn about the past, present and future of artificial intelligence on our latest podcast, Humans vs Machines with Gary Marcus.

Quantum Leaps

Advances That Matter

The prototype of The ANYmal quadroped robot. Courtesy Robotic Systems LabETH Zürich

Weird and wonderful new ways for robots with legs to move. For years, researchers have been trying to build robots that can walk as well as animals and they have made impressive progress: One roboticist recently told Aventine that getting a robot to walk like a human was almost at the point of being a “solved problem.” But one of the benefits of building robots is that they don’t have to conform to the norms of biology, and in a special edition of Science Robotics, published at the end of April, researchers describe a series of new designs for legged robots that help them navigate the world in decidedly nonhuman ways. How about a one-legged robot that hops while being supported by a quadcopter? Or a quadruped robot with wheels at the end of its limbs, to increase efficiency while still enabling it to traverse obstacles such as stairs? Some of the research is undeniably esoteric. But certain areas of inquiry, like exploring new approaches to teaching human-shaped robots to play soccer, underscore the fact that, while walking may be a solved problem for human-inspired machines, there are still plenty of advances to be made in robotic locomotion.

The hyperelastic avatars of our dreams — and nightmares — are here. Picture the scene: It’s the end of the year and you want to present the highlights of the last 12 months to your team. So you point an AI bot at some documents and it not only writes your script but also feeds the resulting text into a hyperrealistic avatar of your head and shoulders, enabling the avatar to deliver the presentation in your voice, with your gestures. This is pretty much what’s on offer with a new product launched at the end of April by the London-based company Synthesia, according to MIT Technology Review. The company uses motion capture, cameras and audio recordings to capture a person’s visual and voice likeness and combines that information with AI models trained on at least 1,000 professional actors. The resulting avatars are then animated using text, which can itself be generated using LLM tools. So will this be a cool new time-saver, or a tool for identity hacking? Given that 10 percent of Synthesia’s staff is dedicated to moderating the content being created on its platform, the company is obviously aware of the potential dangers posed by its tools if they are abused. Still, with corporate deep fake scams on the rise, the existence of ever more sophisticated technology to replicate us is another reason to be doubly vigilant about verifying the legitimacy of anything you see on video. 


Researchers have made the most detailed map of brain tissue yet.
A cubic millimeter of brain tissue doesn’t sound like much. And in one sense, it isn’t: It’s just one-millionth of the brain’s entire volume. But inside that speck of tissue are more than 50,000 cells and 150 million synapses. And now, thanks to researchers from Harvard and Google who have published their work in Science, that little piece of tissue has been rendered in exquisite detail, making it the most detailed view inside the human brain yet created. The process involved cutting the sample from the cortex of a 45-year-old woman during surgery, then slicing it into segments just 34 nanometers thick. Those were then preserved, stained and imaged using electron microscopes. Finally, AI was used to stitch the 1.4 petabytes of underlying data together into a 3D model, which researchers are now poring over to hunt for previously undiscovered features in the brain. Though there is little prospect of building such a detailed model of an entire human brain in the near future, the work is one of many strands of research aiming to better understand the brain’s fundamental underpinnings. Just last year, an international team of researchers mapped and identified 3,000 types of brain cells, most of which were never before understood. Collectively, the advances are helping neuroscientists understand the fundamental structure and behavior of the brain in more detail than ever and could eventually lead to advances in treating brain-related disorders.

Four Ways to Think About

Graphene

A rendering of graphene, which in reality is one-atom thick. Shutterstock

When graphene was discovered at the University of Manchester in 2004, it was heralded as one of the most promising innovations of the early 21st century and won the two researchers who discovered it, Andre Geim and Konstantin Novoselov, the Nobel Prize for Physics in 2010. Why the excitement? It was the world’s first known two-dimensional material (meaning it is made up of a single layer of atoms) and has a collection of properties that rarely exist together in the same material. It is the lightest man-made material that exists; it is hundreds of times stronger than steel; it’s highly flexible and also an extraordinary conductor of electricity. Finally, when layers of graphene are placed on top of one another at different angles, the properties of the material changes, allowing for potential applications that are currently hard to imagine, such as the creation of successful and perhaps even affordable quantum computers. Combined, all of these qualities could be employed to accelerate innovation in almost every industry, ranging from computers to cars to construction — an intoxicating prospect. 

Yet despite all of the excitement around its discovery, graphene has spent nearly two decades in research purgatory. Researchers could not make enough of the stuff even to study its properties effectively, let alone find ways to use it in practical, real-world applications. The original method of making graphene — sheets of which are sometimes used to make so-called graphene devices — involved applying tape to graphite and pulling off little flecks of material, a highly inefficient process that cannot be replicated at scale.

Only in the last few years has the production process changed, as technologies have finally improved enough for graphene to be studied more widely in labs. Practical applications are just now starting to follow. Only a few startups across the world are trying to make graphene for bulk applications (like adding to concrete or rubber), and only a few laboratories across the world make the material for research purposes. But that should all change in the near future, as the costs of creating graphene finally begin to come down. If investors can be persuaded to fund equipment to improve how labs create graphene, we might soon be able to make enough of the material to advance research beyond theoretical questions, experts told Aventine. 

Diamonds and graphite are both just carbon. But the different ways of assembling these materials gives you totally different properties. Graphite is very soft, diamonds are very hard, graphite is conducting, diamonds are insulating. Do you really know what’s going to happen when you put (layers of graphene) together and let them interact with each other? Graphene is one of the simplest materials and yet it is one of the richest materials to give you different phenomena in the same material depending on how many layers you have, and whether there are angles between layers. There’s something going on that I think can eventually be related to quantum computation, or quantum engineering. The sort of behavior that these layers of carbon exhibit could be a good starting point.” 
— Long Ju, a condensed matter physicist at MIT studying novel quantum materials

We [currently] make one or two [graphene] devices every few weeks. Ideally, we would want to be able to make one million devices in a single shot.The scalability is something which, on the more applied side, would have two big important consequences. One, we would be able to do a lot more basic science. We would be able to test way more things than we can right now.… If we were able to go much faster, then we would advance science much faster. We would be able also to actually make real technology and applications, not just pure science. Given how wonderful these materials are and what we can do with them, it is a bit of a shame that we cannot make many of them. I don’t think it's an extremely difficult problem … it’s just that this is all very new.… We need to have sufficient incentive to apply substantial resources to make it all happen,”
— Pablo Jarillo-Herrero, also a physicist at MIT known for “twistronics” and groundbreaking discoveries about the properties of graphene when it’s layered at different angles

I helped to found this company, Universal Matter, that has just finished [its] pilot facility, and can produce one ton per day of graphene. That begins to really change the dynamic. I don’t know what their price is going to be … but their markets demand low- cost material.Their process is so much cheaper now, you can take waste material and turn it into graphene. So the prospects of using graphene now have gone dramatically up. What they are doing is they are selling a formulation that immediately pours into concrete, asphalt, into the rubber tires [to make them more durable and require less material]. And so that’s what they’ve worked out, working out the master batching for the thing that the consumer could just pour in.” 
— James Tour, chemist at Rice University and cofounder of Universal Matter 

To make these devices relevant in commercial applications, you do need to produce these at high volumes for low cost. People are greedy and they say, ‘I want something that is the same price or cheaper and consumes less power.’ So it needs to be produced at high volume and low cost, and that’s a huge challenge that requires research. These are not things you can do in a university lab, and that is my big concern. You have this amazing technology, so how do you actually get it to the market? Typically in the past these things have come from the government. The reason that we have very mature silicon chips is a lot of subsidies from the government. Investors could definitely invest in companies that are dedicated to this, and it should happen.” 
— Frank Koppens, leader of the quantum nano-optoelectronics group at the Institute of Photonic Sciences, focused on developing graphene and other 2D materials

Innovation on the Ground

Technology’s Impact Around the World

The segmentation and partial collapse of a model building. Nature magazine/open access

1. Valencia, Spain. Weak points in a building’s structure seem like a recipe for disaster if an earthquake hits. But a team of researchers from the Technical University of Valencia in Spain has shown that if weaknesses are designed into a building deliberately, they can help prevent catastrophic collapse in times of extreme stress. The researchers have designed plans for buildings in which specific parts of the structure fail early under relatively low levels of stress, weakening the building so that only part of it fails catastrophically under more extreme stress. These plans — based on computer modeling and inspired by some historic civil engineering successes when buildings avoided catastrophic failure — have been turned into reality in the form of a two-story, precast concrete building that was subjected to tests to investigate how it would fail. The results, published in Nature, bore out the modeling. The researchers are now extending their experiments to designs using cast-in-place concrete and steel to determine whether they fail in similar ways, and hope that the approach will inform future building designs in earthquake-prone regions.

2. Lahaina, Maui. On August 8, 2023, a wildfire tore through Lahaina on the Hawaiian island of Maui. By December, 101 people were known to have died as a result. Seventeen were identified by fingerprints, 14 via dental records, two as a result of their medical devices, and three because they died in hospitals. But, as this fascinating story from MIT Technology Review describes, 62 of the deceased were identified using rapid DNA analysis. First employed in a large-scale live disaster during the 2018 Camp Fire in California, the technology is now gaining traction as a means of identifying individuals among large numbers of victims. Unlike regular DNA testing, which requires the analysis of 3 billion base pairs of the human genome, rapid DNA testing looks at 27 genome regions that contain distinctive repeating strings of DNA sequences that are shared by families and yet are almost impossible for unrelated people to have in common. Crucially, the tests can provide results in just two hours. This story from Lahaina describes how the technique can bring closure to families far faster than older techniques in the aftermath of natural disasters or wars — even where none could have been provided in the past.

3. HR departments, everywhere. Recruiters are drowning in a flood of job applications and many of them seem to blame AI. Wired reports that some HR departments — particularly those managing hires in the technology sector, which has seen mass layoffs over the past year — are overwhelmed with the number of applicants. For those on the job market it’s now easier than ever to apply for hundreds of jobs by using tools such as LinkedIn’s Easy Apply feature, which sucks information out of a person’s LinkedIn profile and injects it into a company’s application. Applicants can also use tools like ChatGPT to write resumes or cover letters, or even to complete tests requested by recruiters as part of the hiring process. Meanwhile, AI tools built for recruiters can surface hundreds of potential candidates for roles the candidates are not suited for. The result, recruiters told the magazine, is an incredibly high volume of candidates, many of whom are the wrong fit for open roles — and a growing reluctance among many HR professionals to actually use the automations that were originally designed to make their lives easier.

Long Reads

Magazine and Journal Entries Worth Your Time

The incredible new tech that can recycle all plastics, forever, from New Scientist
2,300 words

One of life’s frustrations for even a casual environmentalist is that so many plastics can’t be recycled. The reason: Most plastic recycling is mechanical, which means the plastics are torn apart by a grinder or shredder, melted down, and then turned into pellets for others to convert into new products. The problem is, only certain plastics can be processed this way, and the process produces lower quality plastic with each round of recycling. This story points to new chemical processes that use very high temperatures to melt — or special solvents to dissolve — plastics into their constituent chemicals. In theory, this means that all plastics could be processed together, their chemical constituents separated in much the same way that crude oil is distilled, allowing for new, virgin plastics to be created from there. There are roadblocks: The processes are energy intensive; ensuring that the resulting chemicals are pure is tricky; and scaling the processes to work at high volumes will take time. Still, it’s clear that there is a future in which far more of our plastic makes its way back into supply chains rather than landfills.

The cloud under the sea, from The Verge
9,200 words

Have you ever wondered about the thousands of miles of undersea cables that move data around the planet so we can work, Zoom and consume Netflix? The crews of the ships traveling the world’s oceans to maintain those cables certainly hope you don’t, as that will mean they’ve done their job properly. Yet as this eye-opening story explains, the workers aboard this collection of boats have terrifically tough jobs. They operate from a small, aging fleet of ships doing work that is technically difficult, painstaking and dangerous — working with “heavy metal hooks and high-tension lines that, if they snap, can cut a person in half." A small number of companies have deals with cable operators to run the ships, but margins are thin given efforts by their clients to cut costs. New boats can cost $100 million and investors are reluctant to make such large capital commitments. This could well end up being short-sighted given how critical these boats and crews are for maintaining the ever-expanding network of cables crisscrossing our oceans.

How ECMO Is Redefining Death, from The New Yorker
4,000 words

Extracorporeal membrane oxygenation, or ECMO for short, isn’t new. The technique, which assumes the function of the heart and lungs outside the body by removing carbon dioxide and adding oxygen to a patient’s blood, was first developed in the 1960s. But it is only in the last decade that the technology has come to be used extensively. It gained particular attention during the COVID-19 pandemic, helping keep patients alive when their lungs had been ravaged by the virus. Now however, as this New Yorker story by a physician named Clayton Dalton describes, ECMO is creating its own new problems, prolonging life indefinitely — and perhaps unnaturally — for patients in intensive care units. Dalton has referred to the technology as a “bridge to nowhere,” that — despite its huge potential for saving lives — is also raising profound ethical questions about how and when patients should be using it.

logo

aventine

About UsPodcast

contact

380 Lafayette St.
New York, NY 10003
info@aventine.org

follow

sign up for updates

If you would like to subscribe to our newsletter and be kept up to date on upcoming Aventine projects, please enter your email below.

© Aventine 2021
Privacy Policy.