Comment by skissane

2 years ago

> Have they actually trained Copilot on their own source? If not, why not?

People have posted illegal Windows source code leaks to GitHub. Microsoft doesn’t seem to care that much because these repos stay up for months or even years at a time without Microsoft DMCAing them-if you go looking you’ll find some right now. I think it is entirely possible, even likely, that some of those repos were included in Copilot’s training data set. So Copilot actually was trained on (some of) Microsoft’s proprietary source code, and Microsoft doesn’t seem to care.

The question is not whether there's some of their code that they don't mind being incorporated, but whether there's any at all that they wouldn't allow to be. And more importantly, not used for their own bot, but for someone else's.

If licenses don't apply to training, then they don't apply for anyone, anywhere. If they do apply, then Copilot is violating my license.

  • IANAL, but they likely believe their unpublished source code contains trade secrets. They may believe that training a public model is okay on published source code (irrespective of its copyright license), but that doing so on unpublished source code containing trade secrets might legally count as a voluntary relinquishment of their trade secrets (if we are talking about their own code) or illegal misappropriation of the trade secrets of others (if they trained it on third party private repos)