ChatGPT, Gemini, Copilot, and other AI tools can create impressive sentences and paragraphs from just a simple line of text prompt. These words are generated by large language models that were trained on vast amounts of human-written text and data scraped from the internet. However, as generative AI tools produce a large volume of synthetic content, this content is now being used to train future generations of these AIs. Researchers warn that if this continues unchecked, it could have disastrous consequences.

Training large language models on their own data could lead to model collapse, according to a recent study by University of Oxford computer scientist Ilia Shumailov and his colleagues published in Nature. Model collapse sounds alarming, but it doesn't mean generative AIs would stop working. Instead, their responses would increasingly deviate from the original training data. Although sometimes biased, this original data is a reasonable representation of reality. However, as the tools train on their own generated data, the small errors they make accumulate, causing their content to lose the nuance of diverse perspectives and eventually turn into gibberish.

Shumailov and his team conducted an experiment using a pretrained language model called OPT-125m, fine-tuning it with Wikipedia articles. They then gave the tool a text prompt and asked it to predict the next words. The response was fed back into the model for further fine-tuning. When each successive generation was trained with data generated by the previous one, they found that by the ninth generation, the model was producing nonsense. A prompt about 14th-century architecture ended up as a list of types of jackrabbits. In another set of experiments, retaining some original data resulted in minor model degradation.

This study highlights the serious implications of training AI on its own responses, including exacerbating bias and turning text into nonsense if left unchecked. While big AI companies have methods to prevent this type of collapse, the increasing use of language models to train chatbots and other AIs by more people could lead to consequences.

Language models and generative AI have been around for decades, primarily in computer science labs. However, the rise of chatbots is more recent, starting with the public release of ChatGPT in November 2022. Advances in hardware that can process information in parallel, the advent of the transformer neural network, and the availability of trillions of high-quality, human-created data points have been key to this dominance.

Shumailov explains that model collapse suggests a decrease in data quality, both in input and output. He uses the example of teaching a computer program what a cat is, noting that such extrapolation comes with subtle errors. This is akin to a game of telephone, where a phrase is whispered from one person to another until it reaches the last person, often ending up badly mangled due to errors introduced along the way. This causes language models to hallucinate, generating plausible but incorrect content.

Model collapse essentially refers to a shift away from the original text used to train the models, according to Leqi Liu, an AI researcher at the University of Texas at Austin. One reason for this is the disappearance of data distribution tails—text representing low-probability events. For example, a model might become very good at describing furry cats but fail to retain information about hairless ones. Similarly, text from minority groups may appear less frequently, sidelining data related to marginalized people.

To prevent AIs from increasing bias or breaking down and producing gibberish, it is crucial to monitor all data and ensure that both prior knowledge (human-generated text) and new knowledge (AI-generated text) are used for training. Another approach could be explicitly capturing the tail of the distribution, such as including information about hairless cats. Companies marketing AI tools heavily check for data drift, so any issues would likely be noticed early and fixed. However, individuals building models on a smaller scale would be more affected and need to be aware of the risk.

Source link:   https://www.sciencenews.org