← Back to context

Comment by halJordan

1 day ago

I love how HN is loving this idea when it's the exact same thing Anthropic and OpenAi (and every other llm maker) did.

It's God's gift to them when it lets them bypass ads and dl copyrighted material. But it's Satan's curse on humanity when the Zuck does it to train his llm and dl copyrighted material.

Both scale and purpose make them completely different things. You're acting as if they're the same when they're not.

I won't comment about dl but ads are trackers and spyware for me. I don't spy on websites' owners, I have my human rights to stop those trackers.

Zuck serves ads/spywares to other users, he deserves to taste his own medicines, not me.

I think there's a little bit of the Goomba fallacy at play here to be fair

Yes, it's a god's gift when the average user can do it, and satan's curse what a hated fucking mega-corp is doing it.

Where's the contradiction?

You can see this pattern in many different topics: updoots are highly correlated with a positive answer to "do I personally get to profit"?

  • Yes, and? People need to eat. Billionaires are generally not interested in whether or not the average Joe gets to eat.

I would love to pay for content. I'm _paying_ for YouTube Premium.

But heck. Do I hate the YouTube interface, it degraded far past usability.

So you’re that Hal Jordan then? Why would a Green Lantern feel the need to defend either? I feel like the Guardians would not accept your arguments as soon as you got to Oa, poozer. I guess what I am saying is don’t have a famous name. Seems obvious.

  • OP appears to be talking about real life. What are you on about?

    • the user name he is responding to is HalJordan, Hal Jordan is the name of a comic book superhero: Green Lantern, a moral paragon.

      on edit: he is evidently being "sarcastic"

You conflate web crawling for inference with web crawling for training.

Web crawling for training is when you ingest content on a mass scale, usually indiscriminately, usually with a dumb crawler for scale's sake, for the purposes of training an LLM. You don't really care whether one particular website is in the dataset (unless it's the size of Reddit), you just want a large, diverse, high-quality data mix.

Web crawling for inference is when a user asks a targeted question, you do a web search, and fetch exactly those resources that are likely to be relevant to that search. Nothing ends up in the training data, it's just context enrichment.

People have a much larger issue with crawling for training than for inference (though I personally think both are equally ok).