← Back to context

Comment by bayindirh

17 hours ago

Note: I'm a graybeard coming from SVN era.

GitHub took a massive hit in credibility when it got bought by Microsoft. We are a burned generation, we have seen the worst of Microsoft. This created a massive crack in the foundation of trust for most people.

Then Copilot happened. Some people dug how the training is done, and one GitHub employee responded by mail that every public repository including GPL repositories are included (the relevant Tweets are deleted unfortunately). The created crack has deepened. Some of us (incl. me) left GitHub.

As Copilot entrenched, Microsoft's product development practices and philosophy took over, and vibe coding started to be used by hordes of developers, GitHub's code foundations started to crumble. Add the big migrations they're doing & regressions they are causing on the UI now, and we're here.

GitHub's first enshittification cycle is over. Now we're starting the second cycle. The bloated, slow, entrenched hegemon's decay from relevance phase.

It'll be a slow decay. It won't fall in a day, but they golden era is long gone.

Any more context on the copilot training note? More pointers would be very interesting, but we'd need to keep in mind how many different underlying models were (are?) branded as copilot. I thought at some points the "copilot" model in autocomplete contexts was a finetuned GPT from OAI.

Re: GPL, there are other open access datasets of git repos that make some distinctions between copyleft licenses but those are older resources now.

  • Please see below. This is from the OG, "first generation" Copilot, from 2022. If I can find any more from my dusty trove, I'll edit or reply to this very comment. I can't do more digging now, because I'm in a pinch.

    > Re: GPL, there are other open access datasets of git repos that make some distinctions between copyleft licenses but those are older resources now.

    Arguably "The Stack" contains only permissively licensed code, but there are two repositories of mine inside it. One is a very simple logging library, without any license (which implies "All Rights Reserved"), and another is a fork of LightDM which I worked on, which is GPL licensed.

    So any "permissively licensed" dataset probably contains at least one copylefted or strong copyrighted codebase, making them highly suspicious.

    == EDIT ==

    Found some. Kagi's date-constrained search to the rescue.

    1. Should GitHub be sued for training Copilot on GPL code?: https://web.archive.org/web/20260428180443/https://github.co...