← Back to context

Comment by dboon

1 day ago

I’m building Cargo/UV for C. Good article. I thought about this problem very deeply.

Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.

But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.

Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.

I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?

Think about the article from a different perspective: several of the most successful and widely used package managers of all time started out using Git, and they successfully transitioned to a more efficient solution when they needed to.

  • Not only this, but (if I understand the article correctly) at least some of them still use git on the backend.

How about the Arch Linux AUR approach?

Every package has its own git repository which for binary packages contains mostly only the manifest. Sources and assets, if in git, are usually in separate repos.

This seems to not have the issues in the examples given so far, which come from using "monorepos" or colocating. It also avoids the "nightmare" you mention since any references would be in separate repos.

The problematic examples either have their assets and manifests colocated, or use a monorepo approach (colocating manifests and the global index).

The alluring thing is storing the repository on S3 (or similar). Recall early docker registries making requests so complicated that backing image storage with S3 was unfeasible, without a proxy service.

The thing that scales is dumb HTTP that can be backed by something like S3.

You don't have to use a cloud, just go with a big single server. And if you become popular, find a sponsor and move to cloud.

If money and sponsor independence is a huge concern the alternative would be: peer-to-peer.

I haven't seen many package managers do it, but it feels like a huge missed opportunity. You don't need that many volunteers to peer inorder to have a lot of bandwidth available.

Granted, the real problem that'll drive up hosting cost is CI. Or rather careless CI without caching. Unless you require a user login, or limit downloads for IPs without a login, caching is hard to enforce.

For popular package repositories you'll likely see extremely degenerate CI systems eating bandwidth as if it was free.

Disclaimer: opinions are my own.

Before you managed to build a popular tool it is unlikely that you need to serve many users. Directly going for something that can serve the world is probably premature

  • For most software, yes. But the value of a package manager is in its adoption. A package manager that doesn’t run up against these problems is probably a failure anyway.

  • The point is not "design to serve the world". The point is "use the right technology for your problem space".

Is there a reason the users must see all of the historic data too? Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?

That can be moved elsewhere / mirrored later if needed, of course. And the underlying data is still in git, just not actively used for the API calls.

It might also be interesting to look at what Linux distros do, like Debian (salsa), Fedora (Pagure), and openSUSE (OBS). They're good for this because their historic model is free mirrors hosted by unpaid people, so they don't have the compute resources.

  • I'm not OP but I'll guess .... lock files with old versions of libs in. The latest version of a library may be v2 but if most users are locked to v1.267.34 you need all the old versions too.

    However a lot of the "data in git repositories" projects I see don't have any such need, and then ...

    > Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?

    ... is a good plan. Usually they make a nice static website with the data that's easy for humans to read though.

> Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.

  • Please share. I’m interested in anything that’s roughly as simple as implementing a centralized registry, is easily inspected by users (preferably with no external tooling), and is very fast.

    It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.

    I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized

    • You don't need fully distributed database, do you?

      You could just make a registry hosted as plain HTTP, with everything signed. And a special file that contains a list of mirrors.

      Clients request the mirror list and the signed hash of the last entry in the Merkel tree. Then they go talk to a random mirror.

      Maybe, you central service requires user sign-in for publishing and reading, while mirrors can't publish, but mirrors don't require sign-in.

      Obviously, you'd have to validate that mirrors are up and populated. But that's it.

      You can start by self hosting a mirror.

      One could go with signing schemes inspired by: https://theupdateframework.io/

      Or one could omit signing all together, so long as you have a Merkel tree with hashes for all publishing events. And the latest hash entry is always fetched from your server along with the mirror list.

      Having all publishing go through a single service is probably desirable. You'll eventually need to do moderation, etc. And hosting your service or a mirror becomes a legal nightmare if there is not moderation.

      Disclaimer: opinions are my own.

    • Package registry in an SQLite database, snapshotted daily. Stored in a cloud bucket. New clients download the latest snapshot, existing clients stream in the updates using eg Litestream. Resolving dependencies should now be ultra fast thanks to indexes.

> I’m building Cargo/UV for C.

Interesting! Do you mind sharing a link to the project at this point?