Comment by YetAnotherNick
2 years ago
Github full (public) scrape is available to anyone. GPT-4 was trained before Microsoft deal so I don't think it is because of Github access. And GPT-4 is significantly better in everything compared to second best model for that field, not just coding.
Is this practically true? Yes, anyone can clone any repo from Github, but surely scraping all of Github would run into rate limits?
The terms and conditions say as much https://docs.github.com/en/site-policy/github-terms/github-t...
Well today you get to learn about the GitHub Archive project, which creates dumps of all GitHub data.
One example is the data hosted in Google Cloud.
https://cloud.google.com/blog/topics/public-datasets/github-...
And there is no evidence that Github is violating any open source licenses.
So they are going to be training on exactly the same data that is available to all.