The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

Pro@programming.dev · 2 months ago

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

FaceDeer@fedia.io · 2 months ago

Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.

droopy4096@lemmy.ca · 2 months ago

I’m confused: why do we have an issue of AI bots crawling internet practically DOS’ing sites? Even if there’s a feed of synthesized data it is apparent that contents of internet sites plays role too. So backfeeding AI slop to AI sounds real to me.

FaceDeer@fedia.io · 2 months ago

Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

BakedCatboy@lemmy.ml · edit-2 2 months ago

Aiui, back-feeding uncurated slop is a real problem. But curated slop is fine. So they can either curate slop or scrape websites, which is almost free. So even though synthetic training data is fine, they still prefer to scrape websites because it’s easier / cheaper / free.

NotSteve_@lemmy.ca · 2 months ago

Are there any articles about this? I believe you but I’d like to read more about the synthetic test data

FaceDeer@fedia.io · 2 months ago

Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.

This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.

leftzero@lemmynsfw.com · 2 months ago

there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run

Ah, of course, it’s LLMs all the way down!

No, but seriously, you’re aware they’re selling this shit as a replacement for search engines, are you not?

FaceDeer@fedia.io · 2 months ago

No, it’s not “LLMs all the way down.” Synthetic data is still ultimately built on raw data, it just improves the form that data takes and includes lots of curation steps to filter it for quality.

I don’t know what you mean by “a replacement for search engines.” LLMs are commonly being used to summarize search engine results, but there’s still a search engine providing it with sources to generate that summary from.

leftzero@lemmynsfw.com · 2 months ago

Synthetic data is still ultimately built on raw data

So they’re still feeding LLMs their own slop, got it.

includes lots of curation steps to filter it for quality

Ah, so it’s going back to the good old days of curated directories like Yahoo. Of course, because that worked so well.

I don’t know what you mean by "a replacement for search engines.

I mean that they’re discontinuing search engines in favour of LLM generated slop. Microsoft just announced it was shutting down the Bing APIs, in favour of Copilot. Google are shoving LLM generated nonsense all over their search. People are asking LLMs questions instead of looking them up in search engines because they’ve been sold the fantasy that you can get useful information out of that shit when it’s evident that all you get is information shaped hallucinated garbage (also because search engines have been intentionally enshittified to the point of being almost as useless). People are being sold dangerous nonsensical misinformation and being told it’s factual information. That’s what I mean.

there’s still a search engine providing it with sources to generate that summary from

No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

Even if you stick the LLM between an actual search engine and the user, it just becomes a perverted game of telephone, with the LLM mangling the user’s prompt into a search prompt that almost certainly will have nothing to do with what the user wanted, which will be fed into the aforementioned enshittified search engine, whose shitty useless results will be fed back into the LLM, which will use them to hallucinate some answer (with inexistent references and all) that will look like an answer to the user’s question (if LLMs are good at anything it’s brainwashing their victims into believing that their answers are correct) while having no bearing whatsoever in reality.

The tragic fact is that LLM’s offer practically no benefits over 40 year old Eliza if you gave it a fraction of the data and computational power they need, while being many orders of magnitude more expensive and resource intensive.

They have no affordable practical applications whatsoever, and the companies selling them are so desperate to earn back the investment and run off with the money before the bubble bursts and everyone realises that the emperor has been hanging his shriveled little dong in front of our faces the whole time that they’re shoving this shit everywhere (notepad!? fucking seriously!?) whether it makes sense or not, burning off products that used to work, and the Internet itself, and replacing them with useless LLM infected shit so their customers have no option but to buy their useless massively overpriced garbage.

FaceDeer@fedia.io · 2 months ago

So they’re still feeding LLMs their own slop, got it.

No, you don’t “got it.” You’re clinging hard to an inaccurate understanding of how LLM training works because you really want it to work that way, because you think it means that LLMs are “doomed” somehow.

It’s not the case. The curation and synthetic data generation steps don’t work the way you appear to think they work. Curation of training data has nothing to do with Yahoo’s directories. I have no idea why you would think that’s a bad thing even if it was like that, aside from the notion that “Yahoo failed therefore if LLM trainers are doing something similar to Yahoo then they will also fail.”

I mean that they’re discontinuing search engines in favour of LLM generated slop.

No they’re not. Bing is discontinuing an API for their search engine, but Copilot still uses it under the hood. Go ahead and ask Copilot to tell you about something, it’ll have footnotes linking to other websites showing the search results it’s summarizing. Similarly with Google, you say it yourself right here that their search results have AI summaries in them.

No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

The problem with your understanding of this situation is that Google’s search summary is not solely from the LLM. What happens is Google does the search, finds the relevant pages, then puts the content of those pages into their LLM’s context and asks the LLM to create a summary of that information relevant to the search that was used to find it. So the LLM doesn’t actually need to have that information trained into it, it’s provided as part of the context of the prompt,

You can experiment a bit with this yourself if you want. Google has a service called NotebookLM, https://notebooklm.google.com/, where you can upload a document and then ask an LLM questions about the documents’ contents. Go ahead and upload something that hasn’t been in any LLM training sets and ask it some questions. Not only will it give you answers, it’ll include links that point to the sections of the source documents where it got those answers from.

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT – Communications of the ACM