Comment by dboon

2 months ago

I’m building Cargo/UV for C. Good article. I thought about this problem very deeply.

Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.

But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.

Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.

I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?

22 comments

dboon

dkarl 2 months ago

Think about the article from a different perspective: several of the most successful and widely used package managers of all time started out using Git, and they successfully transitioned to a more efficient solution when they needed to.

zephen 2 months ago

Not only this, but (if I understand the article correctly) at least some of them still use git on the backend.

baobun 2 months ago

How about the Arch Linux AUR approach?

Every package has its own git repository which for binary packages contains mostly only the manifest. Sources and assets, if in git, are usually in separate repos.

This seems to not have the issues in the examples given so far, which come from using "monorepos" or colocating. It also avoids the "nightmare" you mention since any references would be in separate repos.

The problematic examples either have their assets and manifests colocated, or use a monorepo approach (colocating manifests and the global index).

dboon 2 months ago

The problem is that Arch doesn't need to quickly resolve (version -> manifest) for arbitrary versions. With Arch, /var/lib/pacman/sync/core.db has one release of a set of packages. When you install, you just grab whatever's there. Rolling release. pacman -Syu pulls the newest version of that set of packages. If you install sqlite 3.0 and then come back a few years later and "reinstall" all the Arch packages you used to have on a new machine, you'll either (a) use that exact database and pull the same version or (b) pacman -Syu, pull latest package database, and get the newest sqlite (say, 3.5)
There's no concept of installing sqlite 3.0 on a system where sqlite 3.5 is available.
For a language package manager, it's exactly the opposite. I could make a project with every version of sqlite the package manager has ever known about. They all must be resolvable.
If you want to do that resolution quickly (which manifest do I use for sqlite 3.0?), repo-per-package doesn't work without a bunch of machinery that makes it, IMO, not worth it.
Pacman is the best, you'd have to pry Arch from my cold, dead hands. Just different constraints.

jopsen 2 months ago

The alluring thing is storing the repository on S3 (or similar). Recall early docker registries making requests so complicated that backing image storage with S3 was unfeasible, without a proxy service.

The thing that scales is dumb HTTP that can be backed by something like S3.

You don't have to use a cloud, just go with a big single server. And if you become popular, find a sponsor and move to cloud.

If money and sponsor independence is a huge concern the alternative would be: peer-to-peer.

I haven't seen many package managers do it, but it feels like a huge missed opportunity. You don't need that many volunteers to peer inorder to have a lot of bandwidth available.

Granted, the real problem that'll drive up hosting cost is CI. Or rather careless CI without caching. Unless you require a user login, or limit downloads for IPs without a login, caching is hard to enforce.

For popular package repositories you'll likely see extremely degenerate CI systems eating bandwidth as if it was free.

Disclaimer: opinions are my own.

adrianN 2 months ago

Before you managed to build a popular tool it is unlikely that you need to serve many users. Directly going for something that can serve the world is probably premature

dboon 2 months ago

For most software, yes. But the value of a package manager is in its adoption. A package manager that doesn’t run up against these problems is probably a failure anyway.
EPWN3D 2 months ago

The point is not "design to serve the world". The point is "use the right technology for your problem space".

mook 2 months ago

Is there a reason the users must see all of the historic data too? Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?

That can be moved elsewhere / mirrored later if needed, of course. And the underlying data is still in git, just not actively used for the API calls.

It might also be interesting to look at what Linux distros do, like Debian (salsa), Fedora (Pagure), and openSUSE (OBS). They're good for this because their historic model is free mirrors hosted by unpaid people, so they don't have the compute resources.

jarofgreen 2 months ago

I'm not OP but I'll guess .... lock files with old versions of libs in. The latest version of a library may be v2 but if most users are locked to v1.267.34 you need all the old versions too.
However a lot of the "data in git repositories" projects I see don't have any such need, and then ...
> Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?
... is a good plan. Usually they make a nice static website with the data that's easy for humans to read though.

ambicapter 2 months ago

> Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.

dboon 2 months ago
Please share. I’m interested in anything that’s roughly as simple as implementing a centralized registry, is easily inspected by users (preferably with no external tooling), and is very fast.
It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.
I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized
- jopsen 2 months ago
  
  You don't need fully distributed database, do you?
  You could just make a registry hosted as plain HTTP, with everything signed. And a special file that contains a list of mirrors.
  Clients request the mirror list and the signed hash of the last entry in the Merkel tree. Then they go talk to a random mirror.
  Maybe, you central service requires user sign-in for publishing and reading, while mirrors can't publish, but mirrors don't require sign-in.
  Obviously, you'd have to validate that mirrors are up and populated. But that's it.
  You can start by self hosting a mirror.
  One could go with signing schemes inspired by: https://theupdateframework.io/
  Or one could omit signing all together, so long as you have a Merkel tree with hashes for all publishing events. And the latest hash entry is always fetched from your server along with the mirror list.
  Having all publishing go through a single service is probably desirable. You'll eventually need to do moderation, etc. And hosting your service or a mirror becomes a legal nightmare if there is not moderation.
  Disclaimer: opinions are my own.
- yawaramin 2 months ago
  
  Package registry in an SQLite database, snapshotted daily. Stored in a cloud bucket. New clients download the latest snapshot, existing clients stream in the updates using eg Litestream. Resolving dependencies should now be ultra fast thanks to indexes.
  
  2 replies →
- k8ssskhltl 2 months ago
  
  Blockchain.
- strbean 2 months ago
  
  Distributed ledger! /s... ?

krautsauer 2 months ago

I wonder how meson wraps' story fits with this. They used not to, but now they're throwing everything into a single repository [0]. I wonder about the motivation and how it compares to your project.

0: https://github.com/mesonbuild/wrapdb/tree/master/subprojects

dpedu 2 months ago

> I’m building Cargo/UV for C.

Interesting! Do you mind sharing a link to the project at this point?

dboon 2 months ago

Sure! It's very raw, though. There's a lot of functionality, and I use it to build all sorts of projects already. But a common thing I do is to write the stupidest possible version of a thing and only do the hard engineering when it becomes untenable. Hence it's not raw as in being new or bare, but it's very raw in that you'll see some really rough stuff in the code.
But, that being said, here's the repo! I added a very basic README for you. It's one command to bootstrap to a self hosting build, so give it a shot if you're interested. My contact is in my profile.
https://github.com/tspader/spn