Adi Robertson (2024-09-25). Mark Zuckerberg: Creators and Publishers ‘Overestimate the Value’ of Their Work for Training AI.
…if creators are concerned or object, “when push comes to shove, if they demanded that we don’t use their content, then we just wouldn’t use their content. It’s not like that’s going to change the outcome of this stuff that much.”
It sounds callous but it’s true; no single piece of information is very important for LLM training.
Companies building frontier models could completely remove any individual piece of content and it would have a negligible effect on the final model outputs. All the model training is doing is establishing the relationships between words. Unless the way you put words together is so novel that it’s missing from the dataset, it’s unlikely that your content matters that much to the final model.
It does raise the following point though. If individual content creators need not be compensated for their contributions to model training – because all the value is in the collective – then the collective (i.e. society) should be compensated.
And in response I’d say that companies building frontier models are compensating society, by giving us all free access to their most powerful frontier models. And I don’t know if you’ve noticed but those limits are increasing every day.