Comment by voidUpdate

4 days ago

Does this break part 4 of the Goodreads TOS?

"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."

Also did the reviewers give you permission to fed their content into an LLM?

Fairly meaningless in this day and age. Also IIRC scraping legality depends heavily on jurisdiction. Some places take a more permissive view of accessing publicly available information, even if a site's TOS forbids bots.

In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.

[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

  • From that article:

    > However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.

    So why would the same not apply here?

    • They settled out of court, that doesn't mean that they were found to be in breach of the terms.

      These were some of the notable elements (worth noting that none mention breaching terms of service):

      > Damages: Judgment in the amount of $500,000 is entered against hiQ, with all other monetary relief waived.

      > CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”

      > California “CFAA”: hiQ stipulates that LinkedIn “may establish civil liability” under California’s state-law counterpart to the CFAA based on hiQ’s data collection practices, use of fake accounts and other means to evade detection by LinkedIn, hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts, and hiQ’s unauthorized commercial use of data.

      > Trespass: hiQ stipulates that LinkedIn has established judgment as to liability under California law for the common law torts of trespass to chattels and misappropriation.

      > Irreparable harm: hiQ stipulates that LinkedIn has established that it has suffered an irreparable injury and that LinkedIn satisfied the remaining factors and is entitled to a permanent injunction.

      https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

    • A settlement means there was no legal ruling and no precedent set. The entire case is legally moot.

      In America, you can simply pay to not lose any lawsuit ever, and thus never have to face legal consequence or changes to the law you don't like.

      1 reply →

  • > CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”

    This was part of the terms of the settlement.

  • So if you are legally allowed to "adapt, edit or create derivative works from any materials", what's the point of the TOS?

    • The TOS specify the circumstances in which the corp may take action that is unrelated to the legal system. Just because they can't sue you (and easily win) for scraping, doesn't mean they can't block you if they notice you doing it.

      Google for example has a TOS and is well known for permanently banning accounts for real or imagined or AI-generated violations of it. Google banning you for breaking TOS doesn't mean you broke the law, just that you broke their rules, which apparently include a clause against being in the wrong place at the wrong time.

    • I believe TOS is binding as long as it doesn't conflict with the law. If something is deemed fair use under the law, TOS cannot override those legal rights.

      3 replies →

    • That’s a good question. It also would not be the first time that companies use trickery and manipulation or even deliberately illegal practices for various business/financial reasons. At the very least it could be used as a tool to underpin intimidating lawsuits and another step up, regardless of the legality in the relevant jurisdiction, it could be used to influence official government foreign policy to exert pressure on a jurisdiction that permits scraping.

  • Tell that to judyrecords with the same smug attitude.

    Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.

    • This is so overly dramatic it’s hard to even consider the point you’re trying to make.

    • You ok bud? You sound unhinged here. You post doesn't even make sense in context of the one you were replying to.

What expectation of confidentiality are you ascribing to people having posted publicly accessible opinions on the internet?

Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?

  • My expectation isn't of confidentiality, but of attribution. Sure, my website is perfectly accessible on the internet, and I'm fine with being able to find it on google, but if you pipe it into an algorithm that will start throwing out stuff based on what I wrote, with zero reference to me at all, I'd get a bit annoyed. This website has taken the combined output of probably thousands of people, shoved it into an algorithm and is then using their work to give "original" ideas. If one person wanted their content removed from the system, how would you do that?

  • What does that comment have to do with confidentiality?

    • That he viewed a review on Goodreads as the reviewer’s intellectual property hadn’t occurred to me. I see why, in aggregate, many such opinions become valuable, but the whole is more than the sum of its parts.

      So does it feel to you guys like your comments, say, here in this Hacker News thread should be considered effectively copyrighted as your personal IP?

      If so, do you feel the same way about opinions you share out in a supermarket or on the street?

      2 replies →

Technically speaking none of Goodreads material or content is being used publically, the only information displayed on the site is freely available (Title, Author) and not Goodread's property.

You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.

It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"

  • I visit your garden and take 1 apple from your tree

    I visit your garden and take 1000 apples from your tree.

    Not that different.

    • Not only am I taking 1,000 apples, but I use those 1,000 apples to start my own orchard and encourage people to come to it instead of yours.

      5 replies →

    • Not a great analogy, since a digital copy leaves the original intact unlike your apples

    • For every apple I take, you still have your apple on the tree, because my apple is only a copy of yours.

At what point are they feeding reviews into an LLM? From what I got the only personal data they're using is which user read which books.

This is, essentially, why I've withdrawn from posting content from my human brain almost anywhere on the open internet (except here, sometimes) and have retired blog posts, opinions, and so on to our friends WAN.

I’m not taking sides in this debate, however since feeding whole books into LLMs is considered legal fair use now, I guess these reviews don’t require a permission as well. Would be great to hear a professional lawyer take on this.

  • The hidden gotcha in the Anthropic judgement (which I think is what you’re referencing?) is that feeding whole books into LLMs is considered legal fair use if you obtain them legitimately.

    I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.

    My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.

    But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.

    Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.

    So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.

    (Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)

If it's on the internet, and people can access it, then it's public. I would have no expectations for what people do with public data; that just seems like setting yourself up for disappointment.

  • Is a pirated movie, found on bittorrent, public?

    IMO, your definition is overbroad

    • If it's on bittorrent then, yes, it's public. It doesn't matter if you intended it to be or not, it's publicly accessible, therefore it's public.