← Back to context

Comment by fc417fc802

1 hour ago

> attempting to train on AI generated output

I said nothing about that. Good synthetic data does not (typically) involve ML algorithms. Although that might be changing.

I'll politely suggest that you go read the literature before engaging further.

Reddit, Twitter, and similar are valuable because the data covers current events. Their content makes up a reasonably comprehensive timeline of the world at large. You don't need that to train a barebones functional model but it's certainly useful in order to train a knowledgeable one. Regardless, if they're charging for access it clearly isn't piracy so it doesn't seem like your original objection would hold any water in that case.

> I'll politely suggest that you go read the literature before engaging further.

Which commercial AI vendor has not stolen any content when creating their models? I’ll wait.

Which commercial AI vendor has created their models exclusively training on datasets created and created by other AI?

> Regardless, if they're charging for access it clearly isn't piracy so it doesn't seem like your original objection would hold any water in that case.

Given that they were previously violating the site’s terms of service when scraping the content: yes, they were absolutely stealing.