Comment by machinationu

1 day ago

Explain to me how you self-host a git repo which is accessed millions of time a day from CI jobs pulling packages.

I'm not sure whether this question was asked in good faith, but is actually a damn good one.

I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.

I'd love to hear from others about ideas and approaches they've heard about or tried

https://gitlab.com/gitlab-org/gitaly

Let's assume 3 million. That's about 30 per second.

From compute POV you can serve that with one server or virtual machine.

Bandwidth-wise, given a 100 MB repo size, that would make it 3.4 GB/s - also easy terrain for a single server.

  • That is roughly the number of new requests per second, but these are not just light web requests.

    The git transport protocol is "smart" in a way that is, in some ways, arguably rather dumb. It's certainly expensive on the server side. All of the smartness of it is aimed at reducing the amount of transfer and number of connections. But to do that, it shifts a considerable amount of work onto the server in choosing which objects to provide you.

    If you benchmark the resource loads of this, you probably won't be saying a single server is such an easy win :)

    • Here's a web source about how much cpu time it took from 5 years ago: https://github.blog/open-source/git/git-clone-a-data-driven-...

      Using the slowest clone method they measured 8s for a 750 MB repo, 0.45s for a 40MB repo. appears to be linear so 1.1s for 100MB should be a valid interpolation.

      So doing 30 of those per second only takes 33 cores. Servers have hundreds of cores now (eg 384 cores: https://www.phoronix.com/review/amd-epyc-9965-linux-619).

      And remember we're using worst case assumptions in places (using the slowest clone method, and numbers from old hardware). In practice I'd bet a fastish laptop would suffice.

      edit: actually on closer look at the github reported numbers the interpolation isn't straightforward: on the bigger 750MB repo the partial clone is actually said to be slower then the base full clone. However this doesn't change the big picture that it'll easily fit on one server.

      2 replies →

These days, people solve similar problems by wrapping their data in an OCI container image and distribute it through one of the container registries that do not have a practically meaningful pull rate limit. Not really a joke, unfortunately.

  • Even Amazon encourages this, probably not intentionally, more like as a bandaid for bad EKS config that people can do by mistake, but still - you can pull 5 terabytes from ECR for free under their free tier each month.

    • I'd say it'd just Kubernetes in general should've shipped with a storage engine and an installation mechanism.

      It's a very hacky feeling addon that RKE2 has a distributed internal registry if you enable it and use it in a very specific way.

      For the rate at which people love just shipping a Helm chart, it's actually absurdly hard to ship a self contained installation without just trying to hit internet resources.

FTFY:

Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.

Is running the git binary as a read-only nginx backend not good enough? Probably not. Hosting tarballs is far more efficient.

You git init —-bare on a host with sufficient resources. But I would recommend thinking about your CI flow too.

  • no, hundred of thousands of thousands of individual projects CI jobs. OP was talking about package managers for the whole world, not for one company

    • If people depend on remote downloads from different companies for their CI pipelines they’re doing it wrong. Every sensible company sets up a mirror or at least a cache on infra that they control. Rate limiting downloads is the natural course of action for the provider of a package registry. Once you have so many unique users that even civilized use of your infrastructure becomes too much you can probably hire a few people to build something more scalable.

      1 reply →