Comment by ghaff
2 years ago
I doubt Microsoft sees fragments of Windows source code as a particular crown jewel these days. That said, some of it is decades old code that was intended for the public to see (unlike, presumably, anything in a public GitHub repository). And some of it is presumably third-party code licensed to Microsoft that was likewise never intended for public viewing. So, while it would be a good gesture on the part of Microsoft to scan their own code--if they haven't done so--I could see why it might be problematic. (Just as training on private GitHub repos would be.)
tl;dr I think there's a distinction between training on copyrighted but public content and private content.
Private third-party GitHub repos is another good example. If licenses don't apply to training data, as GitHub has asserted, why not use those too? Do they think they'll get in trouble over it? Why doesn't the same trouble apply to my publicly-readable GPL-licensed code?
I assume there's something in their terms of service about not poking around in private repos and using the code even for internal purposes except for necessary maintenance like backups, court orders, etc.
I am not a lawyer but I also assume Microsoft's position, at least in part, is that they can download and use code in GitHub public repos just like anyone else can and developing a public service based on training with that (and a lot of other) code isn't redistributing that code.
Copyright is not the only law. Something might be permitted by copyright law (as fair use, an implied license, etc)-yet simultaneously violate other laws-breach of contract, misappropriation of trade secrets, etc.