Comment by bananapub
1 day ago
anyone who's actually worked there, could you explain why they're finding scalability and reliability so hard? naively it seems like 'repo groups', ie clusters of repositories linked by being mutual forks, would be fairly isolated for the whole git storage layer, and everything else feels pretty easily parallelisable (issues, actions, etc, modulo taking locks now and then to submit results or whatever). and given that, surely you can incrementally deploy changes across those many shards to avoid most big outages?
are there big conceptual serialisations that I've missed? is it just not well factored? was the move to Azure just a catastrophically bad idea? some other thing?
I can't speak to the last few years since I left, but over the many years I was there the git storage layer was almost never the core issue - it was well designed by infrastructure-minded nerds that leveraged and improved git and replicated it really well across multiple nodes.
What always struggled was the richness of the Rails monolith itself and its backing MySQL databases - the expectation that everything links to everything throughout the product (think: issue cross references across orgs that only appear if you're able to access the remote repo, and other things like that).
Those details appear richly everywhere you look, and the combination of that with a general lack of understanding and/or focus on performance (shipping big features is hard, shipping them with performance at scale is MUCH harder), compounded by Ruby being an easy language to get performance wrong in (object count really hurts, and it makes it very easy to create many) leaves every feature adding to the performance problem, and makes it daunting/impossible to make fast once it's slow.
There was a full on year or more of making GitHub fast while I was there that just couldn't gain enough momentum to make enough of a dent to make it better. I remember finding and fixing a N^3 (or maybe it was N^4? something bad) in the home page activity feed - the worst thing I found but gives an idea. IMO it would need a fresh view of how to keep interfaces simple and how to design the data layers performantly - not adding every bell and whistle to every screen.
I hope someone at GitHub realises they are about to lose everything that was hard earned by early GitHub - it once was a site people (myself included) looked up to for ideal availability, responsible releases, data driven improvement - but no more it seems :(
Almost every high volume service on the internet is write a little, read a lot, and when there are writes, they're relatively small, a few bytes into a database that can fan out. GitHub is very different: constant writes, large files, it is under far more pressure than the systems the rest of us build. And then, as the article says, vibecoding happens, and suddenly they're receiving 30x the volume of expensive operations. GitHub are responsible for many of the performance improvements made to Git over the years, Git scales today because of work GitHub did, but that work was never intended to scale to volume of today.
Even as recently as 18 months ago, Lovable appeared, seemingly overnight, and caused huge problems for GitHub because they were creating repositories on GitHub for every single Lovable project, offloading the very high cost onto GitHub, hundreds of thousands of repositories. A couple of years before that, Homebrew used GitHub as a de facto CDN and that was a huge problem, too.
Nowadays it is easy to imagine how we can scale out a service like Twitter or YouTube or Facebook because everything has been done before, but that's not true of Git, Git hasn't ever scaled like this before, there are very few examples of service with GitHub's characteristics.
https://news.ycombinator.com/item?id=42659111
recently there was a twit how GitHub PR diffs had 10 React components PER LINE. And how they optimized that to only 2 React components per line or something.
> To summarize, for every v1 diff line there would be:
> - Minimum of 10-15 DOM tree elements
> - Minimum of 8-13 React Components
> - Minimum of 20 React Event Handlers
> - Lots of small re-usable React Components
https://github.blog/engineering/architecture-optimization/th...
I'm asking about the infrastructure, obviously they chose for some reason to make my computer fans turn on to show some red and green lines on a text file.
terrible frontend architecture suggests poor engineering culture which typically spreads to all teams, including the infrastructure team