Comment by swatcoder
2 years ago
Because they're ultimately training data simulators and not actually brilliant aritifical programmers, we can expect Microsoft-affiliated models like ChatGPT4 and beyond to have much stronger value for coding because they have unmediated access to GitHub content.
So it's most useful to look at other capabilities and opportunities when evaluating LLM's with a different heritage.
Not to say we shouldn't evaluate this one for coding or report our evaluations, but we shouldn't be surprised that it's not leading the pack on that particular use case.
Github full (public) scrape is available to anyone. GPT-4 was trained before Microsoft deal so I don't think it is because of Github access. And GPT-4 is significantly better in everything compared to second best model for that field, not just coding.
Is this practically true? Yes, anyone can clone any repo from Github, but surely scraping all of Github would run into rate limits?
The terms and conditions say as much https://docs.github.com/en/site-policy/github-terms/github-t...
Well today you get to learn about the GitHub Archive project, which creates dumps of all GitHub data.
One example is the data hosted in Google Cloud.
https://cloud.google.com/blog/topics/public-datasets/github-...
And there is no evidence that Github is violating any open source licenses.
So they are going to be training on exactly the same data that is available to all.
idk we're just "have more kids" simulators and we do pretty good at programming as a side-task
Sure, and those of us who have more robust preparation and expoure generally do a better job of it.
Someone doesn't get good at programming with low quality learning sources. Also, a poor comparison because models are not people - might as well complain about how NPCs in games behave because they fail at problems real people can solve.
We are both substrate that has been aggressively optimized for a task with a lot of side benefits. "NPC"s are not optimized at all, they are coded using symbolic rules/deterministic behavior.
Zero chance private github repos make it into openai training data, can you imagine the shitshow if GPT-4 started regurgitating your org's internal codebase?
Org specific AI is, almost certainly, the killer app. This will have to be possible at some point, or OpenAI will be left in the dust.
You are downvoted but I agree.