Comment by saaaaaam

4 days ago

The hidden gotcha in the Anthropic judgement (which I think is what you’re referencing?) is that feeding whole books into LLMs is considered legal fair use if you obtain them legitimately.

I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.

My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.

But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.

Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.

So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.

(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)

Goodreads offers those reviews up publicly by serving them from their webservers to anyone who asks for it.

  • Sorry, I don’t understand the point you’re making. I know that these are publicly available - the point I was making, drawing off the parent comment, is that where it has been deemed fair use in copyright to use books to train LLMs when the content has been legitimately obtained then a similar assessment might apply for this sort of ingestion.

    If content is publicly available that does not necessarily mean it’s free of copyright control: the justification for using the reviews to train an LLM would be based on the fact that fair use means it is not an infringement of copyright. But if the publisher has terms that forbid scraping then that may mean the fair use argument is undermined if it is precedent in the content being legitimately obtained. I’m not a lawyer but it’s quite easy to see how “books can be used for LLM training under fair use but not if you pirate them” extends to “content on the web can be used for LLM training under fair use but not if you’ve breached the terms set out by the publisher”.