• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    32
    arrow-down
    1
    ·
    17 hours ago

    Betteridge’s law of headlines.

    Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

    Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

    But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.

    • NotSteve_@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 hours ago

      Are there any articles about this? I believe you but I’d like to read more about the synthetic test data

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        2
        ·
        6 hours ago

        Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.

        This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.

        • leftzero@lemmynsfw.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          39 minutes ago

          there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run

          Ah, of course, it’s LLMs all the way down!

          No, but seriously, you’re aware they’re selling this shit as a replacement for search engines, are you not?

    • droopy4096@lemmy.ca
      link
      fedilink
      English
      arrow-up
      8
      ·
      14 hours ago

      I’m confused: why do we have an issue of AI bots crawling internet practically DOS’ing sites? Even if there’s a feed of synthesized data it is apparent that contents of internet sites plays role too. So backfeeding AI slop to AI sounds real to me.

      • BakedCatboy@lemmy.ml
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        14 hours ago

        Aiui, back-feeding uncurated slop is a real problem. But curated slop is fine. So they can either curate slop or scrape websites, which is almost free. So even though synthetic training data is fine, they still prefer to scrape websites because it’s easier / cheaper / free.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        1
        ·
        13 hours ago

        Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

        The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

        It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.