Comment by foresterre

2 days ago

I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.

P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.

9 comments

foresterre

ralph84 2 days ago

Presumably their position remains that training on public repos is fair use and doesn't require a license. If it doesn't require a license it's still "appropriately licensed".

stingraycharles 2 days ago

I assume they took the actual repos’ licenses info account. I don’t understand why they should ask for permission when the license would already allow for it.

foresterre 2 days ago
Almost all licenses have requirements to redistribute copies of the work, or derivatives thereof. Even permissive licenses do. It's very little to ask when open source dev's provided thousands of hours of free work.
For example, the Apache 2.0 license requires in just 4.c:
You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works;
Just because they're tokenized and transformed into a probabilistic mapping, doesn't suddenly mean that they weren't copied.
I find it morally unethical that they (likely) just ingest IP of all open source repo's without asking, but also importantly without any attribution.
Let me also note that I'm not against LLM's in general. But I do think training on open source must be opt-in, and I look forward to a world with actually ethical, and traceable (i.e. on what they were trained on, like a bill of materials (BOM)), models.
- stingraycharles 1 day ago
  
  But that’s what I meant with taking it into account. They would likely only use BSD and MIT licensed repos, which is a lot.
rocqua 2 days ago
Which licenses allow usage for training? MIT, BSD, etc likely do. But I would expect it gets weird for all the various copyleft licences.
- cortesoft 2 days ago
  
  Why would it get weird for those?
  
  1 reply →

VortexLain 2 days ago

Recently, GitHub has changed their terms of service to use all user data for AI training unless users explicitly opt out. This is probably the way Microsoft has obtained "appropriately licensed data".

mattnewton 2 days ago

this is almost certainly too recent to have been used for training data, no? Unless they optimistically included most repos somehow?