Comment by loloquwowndueo

2 months ago

Just a reminder that GitHub is not git.

The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.

18 comments

loloquwowndueo

machinationu 2 months ago

Explain to me how you self-host a git repo which is accessed millions of time a day from CI jobs pulling packages.

freedomben 2 months ago

I'm not sure whether this question was asked in good faith, but is actually a damn good one.
I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.
I'd love to hear from others about ideas and approaches they've heard about or tried
https://gitlab.com/gitlab-org/gitaly
fweimer 2 months ago
These days, people solve similar problems by wrapping their data in an OCI container image and distribute it through one of the container registries that do not have a practically meaningful pull rate limit. Not really a joke, unfortunately.
- mystifyingpoi 2 months ago
  
  Even Amazon encourages this, probably not intentionally, more like as a bandaid for bad EKS config that people can do by mistake, but still - you can pull 5 terabytes from ECR for free under their free tier each month.
  
  1 reply →
ozim 2 months ago

FTFY:
Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.
fulafel 2 months ago
Let's assume 3 million. That's about 30 per second.
From compute POV you can serve that with one server or virtual machine.
Bandwidth-wise, given a 100 MB repo size, that would make it 3.4 GB/s - also easy terrain for a single server.
- heavenlyhash 2 months ago
  
  That is roughly the number of new requests per second, but these are not just light web requests.
  The git transport protocol is "smart" in a way that is, in some ways, arguably rather dumb. It's certainly expensive on the server side. All of the smartness of it is aimed at reducing the amount of transfer and number of connections. But to do that, it shifts a considerable amount of work onto the server in choosing which objects to provide you.
  If you benchmark the resource loads of this, you probably won't be saying a single server is such an easy win :)
  
  3 replies →
favflam 2 months ago

Is running the git binary as a read-only nginx backend not good enough? Probably not. Hosting tarballs is far more efficient.
adrianN 2 months ago
You git init —-bare on a host with sufficient resources. But I would recommend thinking about your CI flow too.
- machinationu 2 months ago
  
  no, hundred of thousands of thousands of individual projects CI jobs. OP was talking about package managers for the whole world, not for one company
  
  2 replies →

justincormack 2 months ago

They probably would have experienced issues way sooner, as the self hosted tools don't scale nearly as well.