• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    4
    ·
    17 hours ago

    Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

    The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

    It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.