Comment by rafram

6 hours ago

The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.

24 comments

rafram

rapind 5 hours ago

If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?

/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.

tanseydavid 4 hours ago
>> If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist?
If the contract was "work-for-hire" then yes, of course I can.
- rapind 4 hours ago
  
  Maybe I wasn't clear. The playlist includes the material in it. Just like curated AI does.

datsci_est_2015 4 hours ago

Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.

visarga 5 hours ago

No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.

jaen 6 hours ago

...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.

Ajedi32 5 hours ago
No, public data is not generally written by "experts expecting compensation".
By the way, I don't expect you to pay me for this comment. You can just read it for free. You're welcome.
- jaen 4 hours ago
  
  Ugh, please don't read strawmen into other's arguments and try to follow the HN guidelines.
  Also, how about making proper arguments yourself? The vast majority of the training data isn't generated by company-paid AI experts either.
  Notably, books, even though they don't form a large part of the training data, significantly improve performance on some tasks (same way as expert-generated data).
  Why do you think the AI labs are so eager about scanning (and then destroying) every book on the planet?
  If you removed all copyrighted works from the training corpus, the model would be notably weaker.
- calgoo 5 hours ago
  
  No, but people do upload data with an expectation that the data not being used without their permission (unless they do a BSD/MIT/Public domain like license). Otherwise, the platform AND/OR the user do expect the data NOT to be used for purposes other then what it was intended for. Your comment is still your comment, and the hacker news platform also has a say in this. If there had been an opt-in, then fine no problem, but there was none, they just trained on everything available, including downloading pirated books from the internet.
  
  3 replies →
- pastel8739 5 hours ago
  
  Books?
  
  1 reply →
rafram 6 hours ago
I didn't say that.
- thom 5 hours ago
  
  No, you just parroted an increasingly popular talking point, the entire purpose of which seems to be to absolve AI companies of the enormous theft that put them in the position to hire experts in the first place.
  
  6 replies →
- jaen 5 hours ago
  
  Indeed, that's exactly why I replied - you omitted one side from the discussion.

freejazz 5 hours ago

So? What about the authors of all the works these companies stole?