Masse, B. (2024, November 8). OpenAI’s data scraping wins big as Raw Story’s copyright lawsuit dismissed by NY court. VentureBeat.
The judge noted that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.” This reflects a key difficulty in these types of cases: generative AI is designed to synthesize information rather than replicate it verbatim. The plaintiffs failed to present convincing evidence that their specific works were directly infringed in a way that led to identifiable harm.
This is an important ruling that should settle many of the concerns around copyright infringement when using LLMs.
In addition, it’s another data point in support of the idea that we should think about language models in a similar way to how we think about human memory. We see, hear, and read information but our memory of that information isn’t photographic; recall is reconstructive rather than the functional equivalent of retrieving information from a database.
Another point from the ruling that’s worth reflecting on is the scale of the training dataset, and the likelihood that anything an individual created matters. This is related to the idea of ‘harm’ when it comes to creators who feel like they deserve compensation because their works were (possibly) included in the dataset. The reality is that any collection of work the model is trained on is unlikely to have had any significant impact on how it works.