Comment by swatcoder

2 years ago

Because they're ultimately training data simulators and not actually brilliant aritifical programmers, we can expect Microsoft-affiliated models like ChatGPT4 and beyond to have much stronger value for coding because they have unmediated access to GitHub content.

So it's most useful to look at other capabilities and opportunities when evaluating LLM's with a different heritage.

Not to say we shouldn't evaluate this one for coding or report our evaluations, but we shouldn't be surprised that it's not leading the pack on that particular use case.

11 comments

swatcoder

YetAnotherNick 2 years ago

Github full (public) scrape is available to anyone. GPT-4 was trained before Microsoft deal so I don't think it is because of Github access. And GPT-4 is significantly better in everything compared to second best model for that field, not just coding.

avita1 2 years ago
Is this practically true? Yes, anyone can clone any repo from Github, but surely scraping all of Github would run into rate limits?
The terms and conditions say as much https://docs.github.com/en/site-policy/github-terms/github-t...
- vineyardmike 2 years ago
  
  Well today you get to learn about the GitHub Archive project, which creates dumps of all GitHub data.
  One example is the data hosted in Google Cloud.
  https://cloud.google.com/blog/topics/public-datasets/github-...
threeseed 2 years ago

And there is no evidence that Github is violating any open source licenses.
So they are going to be training on exactly the same data that is available to all.

whimsicalism 2 years ago

idk we're just "have more kids" simulators and we do pretty good at programming as a side-task

swatcoder 2 years ago

Sure, and those of us who have more robust preparation and expoure generally do a better job of it.
preommr 2 years ago
Someone doesn't get good at programming with low quality learning sources. Also, a poor comparison because models are not people - might as well complain about how NPCs in games behave because they fail at problems real people can solve.
- whimsicalism 2 years ago
  
  We are both substrate that has been aggressively optimized for a task with a lot of side benefits. "NPC"s are not optimized at all, they are coded using symbolic rules/deterministic behavior.

ironrabbit 2 years ago

Zero chance private github repos make it into openai training data, can you imagine the shitshow if GPT-4 started regurgitating your org's internal codebase?

nomel 2 years ago

Org specific AI is, almost certainly, the killer app. This will have to be possible at some point, or OpenAI will be left in the dust.
whimsicalism 2 years ago

You are downvoted but I agree.