Michael Rowe

Trying to get better at getting better

People tend to overestimate their content value in LLM training

Adi Robertson (2024-09-25). Mark Zuckerberg: Creators and Publishers ‘Overestimate the Value’ of Their Work for Training AI.

…if creators are concerned or object, “when push comes to shove, if they demanded that we don’t use their content, then we just wouldn’t use their content. It’s not like that’s going to change the outcome of this stuff that much.”

It sounds callous but it’s true; no single piece of information is very important for LLM training.

Companies building frontier models could completely remove any individual piece of content and it would have a negligible effect on the final model outputs. All the model training is doing is establishing the relationships between words. Unless the way you put words together is so novel that it’s missing from the dataset, it’s unlikely that your content matters that much to the final model.

It does raise the following point though. If individual content creators need not be compensated for their contributions to model training – because all the value is in the collective – then the collective (i.e. society) should be compensated.

And in response I’d say that companies building frontier models are compensating society, by giving us all free access to their most powerful frontier models. And I don’t know if you’ve noticed but those limits are increasing every day.


Share this


Discover more from Michael Rowe

Subscribe to get the latest posts to your email.