Package managers keep using Git as a database, it never works out

1 month ago (nesbitt.io)

482 comments

birdculture

This seems like a tragedy of the commons -- GitHub is free after all, and it has all of these great properties, so why not? -- but this kind of decision making occurs whenever externalities are present.

My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.

Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.

Aurornis 1 month ago
> Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time.
I don’t know what you mean by software houses, but every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
This has been common wisdom for decades. I don’t know how many times I’ve heard the repeated quote about how Amazon loses $X million for every Y milliseconds of page loading time, as an example.
- rovr138 1 month ago
  
  There was a thread here earlier this month,
  > Helldivers 2 devs slash install size from 154GB to 23GB
  https://news.ycombinator.com/item?id=46134178
  Section of the top comment says,
  > It seems bizarre to me that they'd have accepted such a high cost (150GB+ installation size!) without entirely verifying that it was necessary!
  and the reply to it has,
  > They’re not the ones bearing the cost. Customers are.
  
  23 replies →
- dijit 1 month ago
  
  I worked in e-commerce SaaS in 2011~ and this was true then but I find it less true these days.
  Are you sure that you’re not the driving force behind those metrics; or that you’re not self-selecting for like-minded individuals?
  I find it really difficult to convince myself that even large players (Discord) are measuring startup time. Every time I start the thing I’m greeted by a 25s wait and a `RAND()%9` number of updates that each take about 5-10s.
  
  19 replies →
- ponector 1 month ago
  
  Contrary, every consumer facing product I've worked had no performance metrics tracked. And for enterprise software it was even worse as the end user is not the one who makes a decision to buy and use software.
  >>what you mean by software houses
  How about Microsoft? Start menu is a slow electron app.
  
  15 replies →
- ponector 1 month ago
  
  >> I don’t know what you mean by software houses, but every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
  Maybe Google? Gmail app is 700+ MB
- moregrist 1 month ago
  
  > I don’t know how many times I’ve heard the repeated quote about how Amazon loses $X million for every Y milliseconds of page loading time, as an example.
  This is true for sites that are trying to make sales. You can quantify how much a delay affects closing a sale.
  For other apps, it’s less clear. During its high-growth years, MS Office had an abysmally long startup time.
  Maybe this was due to MS having a locked-in base of enterprise users. But given that OpenOffice and LibreOffice effectively duplicated long startup times, I don’t think it’s just that.
  You also see the Adobe suite (and also tools like GIMP) with some excruciatingly long startup times.
  I think it’s very likely that startup times of office apps have very little impact on whether users will buy the software.
  
  2 replies →
- xp84 1 month ago
  
  > every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
  Must be nice. In my career, all working on webapps, I've seen a few leaders popping in to ask us to fix a particularly egregious performance issue if the right customers complain, but aside from those finely-targeted and limited-attention-span drives to "improve performance" it seems the answer for the past decade or so is just to assume everyone is on at least a gigabit connection, stick fingers in ears, and just keep adding more node modules. If the developers' disks get full because node_modules got too big, buy a bigger SSD and keep going. (ok that last part is slight hyperbole but I also don't think frontend devs would be deterred from their ravenous appetite for libraries by a full disk).
- j_w 1 month ago
  
  Clearly Amazon doesn't care about that sentiment across the board. Plenty of their products are absurdly slow because of their poor engineering.
- Yoric 1 month ago
  
  Can confirm at least for Firefox. When I worked on it, I've spent literal years shaving seconds from startup, or shutdown, or milliseconds from tab switching.
  Everybody likes to hate Telemetry, and yes, it can be abused, but that's how Mozilla (and its competitors) manage to make user's life more comfortable.
- mindslight 1 month ago
  
  > every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
  Are they evaluating the shape of that line with the same goal as the stonk score? Time spent by users is an "engagement" metric, right?
- eviks 1 month ago
  
  The issue here is not tracking, but developing. Like, how do you explain the fact that whole classes of software have gotten worse on those "key metrics"? (and that includes web-selling webpages)
- croes 1 month ago
  
  Then why do many software house favor cloud software over on premise?
  They often have a recognizable delay to user data input compared to local software
  
  1 reply →
- venturecruelty 1 month ago
  
  >I don’t know what you mean by software houses, but every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric.
  Then respectfully, uh, why is basically all proprietary software slow as ass?
- pjmlp 1 month ago
  
  An exception that confirms the rule.
ekjhgkejhgk 1 month ago
I wouldn't call it tragedy of the commons, because it's not a commons. It's owned by microsoft. They're calculating that it's worth it for them, so I say take as much as you can.
Commons would be if it's owned by nobody and everyone benefits from its existence.
- dahart 1 month ago
  
  > so I say take as much as you can. Commons would be if it’s owned by nobody
  This isn’t what “commons” means in the term ‘tragedy of the commons’, and the obvious end result of your suggestion to take as much as you can is to cause the loss of access.
  Anything that is free to use is a commons, regardless of ownership, and when some people use too much, everyone loses access.
  Finite digital resources like bandwidth and database sizes within companies are even listed as examples in the Wikipedia article on Tragedy of the Commons. https://en.wikipedia.org/wiki/Tragedy_of_the_commons
  
  22 replies →
- TeMPOraL 1 month ago
  
  Still, because reality doesn't respect boundaries of human-made categories, and because people never define their categories exhaustively, we can safely assume that something almost-but-not-quite like a commons, is subject to an almost-but-not-quite tragedy of the commons.
  
  34 replies →
- jasonkester 1 month ago
  
  It has the same effect though. A few bad actors using this “free” thing can end up driving the cost up enough that Microsoft will have to start charging for it.
  The jerks get their free things for a while, then it goes away for everyone.
  
  6 replies →
- groundzeros2015 1 month ago
  
  A public park suffers from tragedy of the commons even though it’s managed by the city.
- drob518 1 month ago
  
  Right. Microsoft could easily impose a transfer fee if over a certain amount that would allow “normal” OSS development of even popular software to happen without charge while imposing a cost to projects that try to use GitHub like a database.
- TUSF 1 month ago
  
  I wouldn't call it "tragedy of the commons" because the very idea was coined as a strawman. As far as I'm concerned, the entire concept is a fallacy, and people should stop perpetuating it.
- rvba 1 month ago
  
  I doubt anyone is calculating
  Remember how GTA5 took 10 minutes to start and nobody cared? Lots of software is like this.
  Some Blizzard games download 137 MB file every time you run them and take few minutes to start (and no, this is not due to my computer).
- PunchyHamster 1 month ago
  
  Well, till you choose to host something yourself and it becomes popular
- ericyd 1 month ago
  
  Tragedy of the Microsoft just doesn't sound as nice though
solatic 1 month ago
If you think too hard about this, you come back around to Alan Kay's quote about how people who are really serious about software should build their own hardware. Web applications, and in general loading pretty much anything over the network, is a horrible, no-good, really bad user experience, and it always will be. The only way to really respect the user is with native applications that are local-first, and if you take that really far, you build (at the very least) peripherals to make it even better.
The number of companies that have this much respect for the user is vanishingly small.
- phkahler 1 month ago
  
  >> The number of companies that have this much respect for the user is vanishingly small.
  I think companies shifted to online apps because #1 it solved the copy protection problem. FOSS apps are not in any hurry to become centralized because they dont care about that issue.
  Local apps and data are a huge benefit of FOSS and I think every app website should at least mention that.
  "Local app. No ads. You own your data."
  
  1 reply →
- hombre_fatal 1 month ago
  
  Software I don’t have to install at all “respects me” the most.
  Native software being an optimum is mostly an engineer fantasy that comes from imagining what you can build.
  In reality that means having to install software like Meta’s WhatsApp, Zoom, and other crap I’d rather run in a browser tab.
  I want very little software running natively on my machine.
  
  6 replies →
- ghosty141 1 month ago
  
  Yes because users don't appreciate this enough to pay for the time this takes.
Y-bar 1 month ago
You’ll enjoy ”Saving Lives” by Andy Hertzfied: https://www.folklore.org/Saving_Lives.html
> "The Macintosh boots too slowly. You've got to make it faster!"
- kkjjjjw 1 month ago
  
  https://news.ycombinator.com/item?id=44843223#44879509
zahlman 1 month ago

> Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
This is what people mean about speed being a feature. But "user time" depends on more than the program's performance. UI design is also very important.
bawolff 1 month ago
> Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Google and amazon are famous for optimizing this. Its not an externality to them though, even 10s of ms can equal an extra sale.
That said, i don't think its fair to add time up like that. Saving 1 second for 600 people is not the same as saving 10 minutes for 1 person. Time in small increments does not have the same value as time in large increments.
- esafak 1 month ago
  
  1. If you can price the cost of the externality, you can justify optimizing it.
  2. Monopolies and situations with the principal/agent dilemma are less sensitive to such concerns.
  
  2 replies →
ozim 1 month ago
About apps done by software houses, even though we should strive for doing good job and I agree with sentiment...
First argument would be - take at least two 0's from your estimation, most of applications will have maybe thousands of users, successful ones will maybe run with 10's of thousands. You might get lucky to work on application that has 100's of thousands, millions of users and you work in FAANG not a typical "software house".
Second argument is - most users use 10-20 apps in typical workday, your application is most likely irrelevant.
Third argument is - most users would save much more time learning how to use applications (or to use computer) properly they use on daily basis, than someone optimizing some function from 2s to 1s. But of course that's hard because they have 10-20 apps daily plus god know how many other not on daily basis. Though still I see people doing super silly stuff in tools like Excel or even not knowing copy paste - so not even like any command line magic.
robmccoll 1 month ago

I don't think most software houses spend enough time even focusing on engineering time. CI pipelines that take tens of minutes to over an hour, compile times that exceed ten seconds when nothing has changed, startup times that are much more than a few seconds. Focus and fast iteration are super important to writing software and it seems like a lot of orgs just kinda shrug when these long waits creep into the development process.
3371 1 month ago
The user hour analogy sounds weird tho, 1s feels 1s regardless how many users you have. It's like the classic Asian teachers' logic of "if you come in 1 min late you are wasting N minutes for all of us in this class." It just does not stack like that.
- BenjiWiebe 1 month ago
  
  If the class takes N minutes and one person arrives 1 minute late, and the rest of the class is waiting for them, it does stack. Every one of those students lost a minute. Far worse than one student losing one minute.
  
  1 reply →
DrewADesign 1 month ago
> Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Wait times don’t accumulate. Depending on the software, to each individual user, that one second will probably make very little difference. Developers often overestimate the effect of performance optimization on user experience because it’s the aspect of user experience optimization their expertise most readily addresses. The company, generally, will have a much better ROI implementing well-designed features and having you squash bugs
- drbojingle 1 month ago
  
  A well designed feature IS considerate of time and attention. Why would I want a game on 20 fps when I could have it on 120? The smoothness of the experience increases my ability to use the experience optimally because I don't have to pay as much attention to it. I'd prefer if my interactions with machines were as smooth as my interactions driving a car down a empty dry highway mid day.
  Prehaps not everyone cares but I've played enough Age of Empires 2 to know that there are plenty of people who have felt value gains coming from shaving seconds off this and that to get compound games over time. It's a concept plenty of folks will be familiar with.
  
  3 replies →
pastor_williams 1 month ago

This was something that I heavily focused on for my feature area a year ago - new user sign up flow. But the decreased latency was really in pursuit of increased activation and conversion. At least the incentives aligned briefly.
inapis 1 month ago
>Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
I have never been convinced by this argument. The aggregate number sounds fantastic but I don't believe that any meaningful work can be done by each user saving 1 second. That 1 second (and more) can simply be taken by me trying to stretch my body out.
OTOH, if the argument is to make software smaller, I can get behind that since it will simply lead to more efficient usage of existing resources and thus reduce the environmental impact.
But we live in a capitalist world and there needs to be external pressure for change to occur. The current RAM shortage, if it lasts, might be one of them. Otherwise, we're only day dreaming for a utopia.
- adrianN 1 month ago
  
  Time saved to increased productivity or happiness or whatever is not linear but a step function. Saving one second doesn’t help much, but there is a threshold (depending on the individual) where faster workflows lead to a better experience. It does make a difference whether a task takes a minute or half a second, at least for me.
- jorvi 1 month ago
  
  But there isn't just one company deciding externalizing cost on the rest of us is a great way to boost profit since it costs them very little. Especially for a monopoly like YouTube that can decide that eating up your battery is fine if it saves them a few cents in bandwidth costs.
  Not all of those externalizing companies abuse your time but whatever they abuse can be expressed in a $ amount and $ can be converted to a median's person time via median wage. Hell, free time is more valuable than whatever you produce during work.
  Say all that boils down to companies collectively stealing 20 minutes of your time each day. 140 minutes each week. 7280 (!) minutes each year, which is 5.05 days, which makes it almost a year over the course of 70 years.
  So yeah, don't do what you do and sweettalk the fact that companies externalize costs (private the profits, socialize the losses). They're sucking your blood.
- Aerroon 1 month ago
  
  One second is long enough that it can put a user off from using your app though. Take notifications on phones for example. I know several people who would benefit from a habitual use of phone notifications, but they never stick to using them because the process of opening (or switching over to) the notification app and navigating its UI to leave a notification takes too long. Instead they write a physical sticky note, because it has a faster "startup time".
  
  2 replies →
- WhyNotHugo 1 month ago
  
  > I have never been convinced by this argument. The aggregate number sounds fantastic but I don't believe that any meaningful work can be done by each user saving 1 second. That 1 second (and more) can simply be taken by me trying to stretch my body out.
  I’d see this differently from a user perspective. If the average operations takes one second less, I’d spend a lot of time less waiting for my computer. I’d also have less idle moments where my mind wanders while waiting for some operation to complete too.
- schubidubiduba 1 month ago
  
  Just because one individual second is small, it still adds up.
  Even if all you do with it is just stretching, there's a chance it will prevent you pulling a muscle. Or lower your stress and prevent a stroke. Or any number of other beneficial outcomes.
gritzko 1 month ago
Let’s make a thought experiment. Suppose that I have a data format and a store that resolves the issues in the post. It is like git meets JSON meets key-value. https://github.com/gritzko/go-rdx
What is the probability of it being used? About 0%, right? Because git is proven and GitHub is free. Engineering aspects are less important.
- pdimitar 1 month ago
  
  I am very interested by something like this but your README is not making it easy to like. Demonstrating with 2-3 sample apps using RDX might have gone a long way.
  So how do I start using it if I, for example, want to use it like a decentralized `syncthing`? Can I? If not, what can I use it for?
  I am not a mathematician. Most people landing on your repo are not mathematicians either.
  We the techies _hate_ marketing with a passion but I as another programmer find myself intrigued by your idea... with zero idea how to even use it and apply it.
- stkdump 1 month ago
  
  Sorry, I am turned off by the CRDT in there. It immediately smells of overengineering to me. Not that I believe git is a better database. But why not just SQL?
  
  4 replies →
vlovich123 1 month ago
I think it’s naive to think engineers or managers don’t realize this or don’t think in these ways.
https://www.folklore.org/Saving_Lives.html
- pdimitar 1 month ago
  
  Is it truly naive if most engineer's careers pass and they never meet even one such manager?
  For 24 years of career I've met the grand total of _two_ such. Both got fired not even 6 months after I got in the company, too.
  Who's naive here?
  
  3 replies →
loloquwowndueo 1 month ago
Just a reminder that GitHub is not git.
The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.
- machinationu 1 month ago
  
  Explain to me how you self-host a git repo which is accessed millions of time a day from CI jobs pulling packages.
  
  16 replies →
- justincormack 1 month ago
  
  They probably would have experienced issues way sooner, as the self hosted tools don't scale nearly as well.
imiric 1 month ago

> GitHub is free after all, and it has all of these great properties, so why not?
The answer is in TFA:
> The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.
JohnHaugeland 1 month ago

> This seems like a tragedy of the commons -- GitHub is free after all, and it has all of these great properties, so why not?
because it's bad at this job, and sqlite is also free
this isn't about "externalities"
gverrilla 1 month ago

Nothing surprising. Capital hates people, even though we sustain his kingdom.
brightball 1 month ago

User time is typically a mix of performance tuning and UX design isn’t it?
threatofrain 1 month ago

> Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time.
Oh no no no. Consumer-facing companies will burn 30% of your internal team complexity budget on shipping the first "frame" of your app/website. Many people treat Next as synonymous with React, and Next's big deal was helping you do just this.
machinationu 1 month ago
[flagged]
- benchloftbrunch 1 month ago
  
  As long as you don't have any security compliance requirements and/or can afford the cost of self hosting your LLM, sure.
  Anyone working in government, banking, or healthcare is still out of luck since the likes of Claude and GPT are (should be) off limits.
- camgunz 1 month ago
  
  I've never been more convinced LLMs are the vanguard of the grift economy now that green accounts are low effort astroturfing on HN.
  
  5 replies →
massysett 1 month ago
> Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
This is perfectly sensible behavior when the developers are working for free, or when the developers are working on a project that earns their employer no revenue. This is the case for several of the projects at issue here: Nix, Homebrew, Cargo. It makes perfect sense to waste the user's time, as the user pays with nothing else, or to waste Github's bandwidth, since it's willing to give bandwidth away for free.
Where users pay for software with money, they may be more picky and not purchase software that indiscriminately wastes their time.
- BobbyTables2 1 month ago
  
  Microsoft would have long gone out of business if users cared about their time being wasted.
  Windows 11 should not be more sluggish than Windows 7.

cesarb 1 month ago

One of these is not like the others...

> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.

This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).

kpcyrd 1 month ago
The author seems a little lost tbh, it's starting with "your users should not all clone your database" which I definitely agree with, but that doesn't mean you can't encode your data in a git graph.
It then digresses into implementation details of Github's backend implementation (how is 20k forks relevant?), then complains about default settings of the "standard" git implementation. You don't need to checkout a git working tree to have efficient key value lookups. Without a git working tree you don't need to worry about filesystem directory limits, case sensitivity and path length limits.
I was surprised the author believes the git-equivalent of a database migration is a git history rewrite.
What do you want me to do, invent my own database? Run postgres on a $5 VPS and have everybody accept it as single-point-of-failure?
- Spivak 1 month ago
  
  > Run postgres on a $5 VPS and have everybody accept it as single-point-of-failure
  Oh how times have changed. Yes, maybe run two $5 VPSs behind a load balancer for HA so you can patch and then put a CDN in front of it to serve the repository content globally to everyone. Sign the packages cryptographically so you can invite people in your community to become mirrors.
  How do people think PyPI, RubyGems, CPAN, Maven Central, or distro Packages work?
  
  3 replies →
bobpaw 1 month ago

I think the article takes issue not with fetching the code, but with fetching the go.mod file that contains index and dependency information. That’s why part of the solution was to host go.mod files separately.
jayd16 1 month ago
Even with git, it should be possible to grab the single file needed without the rest of the repo, but i'ts still trying to round a square peg.
- skywhopper 1 month ago
  
  Honestly I think the article is a bit ahistorical on this one. ‘go get’ pulls the source code into a local cache so it can build it, not just to fetch the go.mod file. If they were having slow CI builds because they didn’t or couldn’t maintain a filesystem cache, that’s annoying, but not really a fault in the design. Anyway, Go improved the design and added an easy way to do faster, local proxies. Not sure what the critique is here. The Go community hit a pain point and the Go team created an elegant solution for it.
  
  1 reply →

dboon 1 month ago

I’m building Cargo/UV for C. Good article. I thought about this problem very deeply.

Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.

But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.

Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.

I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?

dkarl 1 month ago
Think about the article from a different perspective: several of the most successful and widely used package managers of all time started out using Git, and they successfully transitioned to a more efficient solution when they needed to.
- zephen 1 month ago
  
  Not only this, but (if I understand the article correctly) at least some of them still use git on the backend.
baobun 1 month ago
How about the Arch Linux AUR approach?
Every package has its own git repository which for binary packages contains mostly only the manifest. Sources and assets, if in git, are usually in separate repos.
This seems to not have the issues in the examples given so far, which come from using "monorepos" or colocating. It also avoids the "nightmare" you mention since any references would be in separate repos.
The problematic examples either have their assets and manifests colocated, or use a monorepo approach (colocating manifests and the global index).
- dboon 1 month ago
  
  The problem is that Arch doesn't need to quickly resolve (version -> manifest) for arbitrary versions. With Arch, /var/lib/pacman/sync/core.db has one release of a set of packages. When you install, you just grab whatever's there. Rolling release. pacman -Syu pulls the newest version of that set of packages. If you install sqlite 3.0 and then come back a few years later and "reinstall" all the Arch packages you used to have on a new machine, you'll either (a) use that exact database and pull the same version or (b) pacman -Syu, pull latest package database, and get the newest sqlite (say, 3.5)
  There's no concept of installing sqlite 3.0 on a system where sqlite 3.5 is available.
  For a language package manager, it's exactly the opposite. I could make a project with every version of sqlite the package manager has ever known about. They all must be resolvable.
  If you want to do that resolution quickly (which manifest do I use for sqlite 3.0?), repo-per-package doesn't work without a bunch of machinery that makes it, IMO, not worth it.
  Pacman is the best, you'd have to pry Arch from my cold, dead hands. Just different constraints.
jopsen 1 month ago

The alluring thing is storing the repository on S3 (or similar). Recall early docker registries making requests so complicated that backing image storage with S3 was unfeasible, without a proxy service.
The thing that scales is dumb HTTP that can be backed by something like S3.
You don't have to use a cloud, just go with a big single server. And if you become popular, find a sponsor and move to cloud.
If money and sponsor independence is a huge concern the alternative would be: peer-to-peer.
I haven't seen many package managers do it, but it feels like a huge missed opportunity. You don't need that many volunteers to peer inorder to have a lot of bandwidth available.
Granted, the real problem that'll drive up hosting cost is CI. Or rather careless CI without caching. Unless you require a user login, or limit downloads for IPs without a login, caching is hard to enforce.
For popular package repositories you'll likely see extremely degenerate CI systems eating bandwidth as if it was free.
Disclaimer: opinions are my own.
adrianN 1 month ago
Before you managed to build a popular tool it is unlikely that you need to serve many users. Directly going for something that can serve the world is probably premature
- dboon 1 month ago
  
  For most software, yes. But the value of a package manager is in its adoption. A package manager that doesn’t run up against these problems is probably a failure anyway.
- EPWN3D 1 month ago
  
  The point is not "design to serve the world". The point is "use the right technology for your problem space".
mook 1 month ago
Is there a reason the users must see all of the historic data too? Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?
That can be moved elsewhere / mirrored later if needed, of course. And the underlying data is still in git, just not actively used for the API calls.
It might also be interesting to look at what Linux distros do, like Debian (salsa), Fedora (Pagure), and openSUSE (OBS). They're good for this because their historic model is free mirrors hosted by unpaid people, so they don't have the compute resources.
- jarofgreen 1 month ago
  
  I'm not OP but I'll guess .... lock files with old versions of libs in. The latest version of a library may be v2 but if most users are locked to v1.267.34 you need all the old versions too.
  However a lot of the "data in git repositories" projects I see don't have any such need, and then ...
  > Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?
  ... is a good plan. Usually they make a nice static website with the data that's easy for humans to read though.
ambicapter 1 month ago
> Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.
So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.
- dboon 1 month ago
  
  Please share. I’m interested in anything that’s roughly as simple as implementing a centralized registry, is easily inspected by users (preferably with no external tooling), and is very fast.
  It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.
  I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized
  
  6 replies →
krautsauer 1 month ago

I wonder how meson wraps' story fits with this. They used not to, but now they're throwing everything into a single repository [0]. I wonder about the motivation and how it compares to your project.
0: https://github.com/mesonbuild/wrapdb/tree/master/subprojects
dpedu 1 month ago
> I’m building Cargo/UV for C.
Interesting! Do you mind sharing a link to the project at this point?
- dboon 1 month ago
  
  Sure! It's very raw, though. There's a lot of functionality, and I use it to build all sorts of projects already. But a common thing I do is to write the stupidest possible version of a thing and only do the hard engineering when it becomes untenable. Hence it's not raw as in being new or bare, but it's very raw in that you'll see some really rough stuff in the code.
  But, that being said, here's the repo! I added a very basic README for you. It's one command to bootstrap to a self hosting build, so give it a shot if you're interested. My contact is in my profile.
  https://github.com/tspader/spn

ekjhgkejhgk 1 month ago

Do the easy thing while it works, and when it stops working, fix the problem.

Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).

It works fine, Julia has some heuristics to not re-download it too often.

But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.

[1] https://github.com/JuliaRegistries/General/blob/master/Regis...

galenlynch 1 month ago

I believe Julia only uses the Git registry as an authoritative ledger where new packages are registered [1]. My understanding is that as you mention, most clients don't access it, and instead use the "Pkg Protocol" [2] which does not use Git.
[1] https://github.com/JuliaRegistries/General
[2] https://pkgdocs.julialang.org/dev/protocol/
mi_lk 1 month ago
> Do the easy thing while it works, and when it stops working, fix the problem
Another way to phrase this mindset is "fuck around and find out" in gen-Z speak. It's usually practical to an extent but I'm personally not a fan
- sagarm 1 month ago
  
  I've mostly heard FAFO used to describe something obviously stupid.
  Building on the same thing people use for code doesn't seem stupid to me, at least initially. You might have to migrate later if you're successful enough, but that's not a sign of bad engineering. It's just building for where you are, not where you expect to be in some distant future
- zephen 1 month ago
  
  Not at all.
  When you fuck around optimizing prematurely, you find out that you're too late and nobody cares.
  Oh, well, optimization is always fun, so there's that.
  
  1 reply →
0xbadcafebee 1 month ago
This is basically unethical. Imagine anything important in the world that worked this way. "Do nuclear engineering the easy way while it works, and when it stops working, fix the problem."
Software engineers always make the excuse that what they're making now is unimportant, so who cares? But then everything gets built on top of that unimportant thing, and one day the world crashes down. Worse, "fixing the problem" becomes near impossible, because now everything depends on it.
But really the reason not to do it, is there's no need to. There are plenty of other solutions than using Git that work as well or better without all the pitfalls. The lazy engineer picks bad solutions not because it's necessarily easier than the alternatives, but because it's the path of least resistance for themselves.
Not only is this not better, it's often actively worse. But this is excused by the same culture that gave us "move fast and break things". All you have to do is use any modern software to see how that worked out. Slow bug-riddled garbage that we're all now addicted to.
- xboxnolifes 1 month ago
  
  Most of the world does work this way. Problems are solved within certain conditions and for use over a certain time frame. Once those change, the problem gets revisited.
  Most software gets to take it to more of an extreme then many engineering fields since there isn't physical danger. Its telling that the counter examples always use the potentially dangerous problems like medicine or nuclear engineering. The software in those fields are more stringent.
  
  1 reply →
- ModernMech 1 month ago
  
  Hold up... "lazy engineers" are the problem here? What about a society that insists on shoving the work product of unfunded, volunteer engineers into critical infrastructure because they don't want to pay what it costs to do things the right way? Imagine building a nuclear power plant with an army of volunteer nuclear engineers.
  It cannot be the case that software engineers are labelled lazy for not building the at-scale solution to start with, but at the same time everyone wants to use their work, and there are next to no resources for said engineer to actually build the at scale solution.
  > the path of least resistance for themselves.
  Yeah because they're investing their own personal time and money, so of course they're going to take the path that is of least resistance for them. If society feels that's "unethical", maybe pony up the cash because you all still want to rely on their work product they are giving out for free.
  
  1 reply →
- hombre_fatal 1 month ago
  
  On the other hand, GitHub wants to be the place you choose to build your registry for a new project, and they are clearly on board with the idea given that they help massive projects like Nix packages instead of kicking them off.
  As opposed to something like using a flock of free blogger.com blogs to host media for an offsite project.
  
  1 reply →
- ekjhgkejhgk 1 month ago
  
  Fixing problems as they appear is unethical? Ok then.
  You realize, there are people who think differently? Some people would argue that if you keep working on problems you don't have but might have, you end up never finishing anything.
  It's a matter of striking a balance, and I think you're way on one end of the spectrum. The vast majority of people using Julia aren't building nuclear plants.
  
  2 replies →
- soraminazuki 1 month ago
  
  What is wrong with you? You berated and name-called open source volunteers because a blog post taught you that package managers using Git are "bad." Let me be clear: a 3 minute read of a blog post offers neither moral superiority nor technical insights that surpass those of actual maintainers.
  Contrary to the snap conclusion you drew from the article, there are design trade-offs involved when it comes to package managers using Git. The article's favored solution advocates for databases, which in practice, makes the package repository a centralized black box that compromises package reproducibility. It may solve some problems, but still sucks harder in some ways.
  The article is also flat-out wrong regarding Nixpkgs. The primary distribution method for Nixpkgs has always been tarballs, not Git. Although the article has attempted to backpedal [1], it hasn't entirely done so. It's now effectively criticizing collaboration over Git while vaguely suggesting that maybe it’s a GitHub problem. And you think what, that collaboration over Git is "unethical"???
  On one side, there are open-source maintainers contributing their time and effort as volunteers. On the other, there are people like you attacking them, labeling them "lazy" and bemoaning that you're "forced" to rely on the results of their free labor, which you deride as "slow, bug-riddled garbage" without any real understanding. I know whose side I'm on.
  [1]: https://github.com/andrew/nesbitt.io/commit/8e1c21d96f4e7b3c...
  
  1 reply →
IshKebab 1 month ago

> when it stops working, fix the problem
This is too naive. Fixing the problem costs a different amount depending on when you do it. The later you leave it the more expensive it becomes. Very often to the point where it is prohibitively expensive and you just put up with it being a bit broken.
This article even has an example of that - see the vcpkg entry.
zahlman 1 month ago
> 00000000-1111-2222-3333-444444444444 = { name = "REPLTreeViews", path = "R/REPLTreeViews" }
... Should it be concerning that someone was apparently able to engineer an ID like that?
- ekjhgkejhgk 1 month ago
  
  Could you please articulate specifically why that should be concerning?
  Right now I don't see the problem because the only criterion for IDs is that they are unique.
  
  2 replies →
- skycrafter0 1 month ago
  
  If you read the repo README, it just says "generate a uuid". You can use whatever you want as long as it fits the format, it seems.
- adestefan 1 month ago
  
  It’s as random as any other UUID.
  
  3 replies →

steeleduncan 1 month ago

The other conclusion to draw is "Git is a fantastic choice of database for starting your package manager, almost all popular package managers began that way."

saidinesh5 1 month ago
I think the conclusion is more that package definitions can still be maintained on git/GitHub but the package manager clients should probably rely on a cache/db/a more efficient intermediate layer.
Mostly to avoid downloading the whole repo/resolve deltas from the history for the few packages most applications tend to depend on. Especially in today's CI/CD World.
- reactordev 1 month ago
  
  This is exactly the right approach. I did this for my package manager.
  It relies on a git repo branch for stable. There are yaml definitions of the packages including urls to their repo, dependencies, etc. Preflight scripts. Post install checks. And the big one, the signatures for verification. No binaries, rpms, debs, ar, or zip files.
  What’s actually installed lives in a small SQLite database and searching for software does a vector search on each packages yaml description.
  Semver included.
  This was inspired by brew/portage/dpkg for my hobby os.
  
  1 reply →
- pseufaux 1 month ago
  
  This is how WinGet works. It has a small SQLite db it downloads from a hosted url. The DB contains some minimal metadata and a url path to access the full metadata. This way WinGet only has to make API calls for packages it's actually interacting with. As a package manager, it has plenty of problems still, but it's a simple, elegant solution for the git as a DB issue.
edolstra 1 month ago
Indeed. Nixpkgs wouldn't have been as successful if it hadn't been using Git (or GitHub).
Sure, eventually you run into scaling issues, but that's a first world problem.
- l9o 1 month ago
  
  I actually find that nixpkgs being a monorepo makes it even better. The code is surprisingly easy to navigate and learn if you've worked in large codebases before. The scaling issues are good problems to have, and git has gotten significantly better at handling large repos than it was a decade ago, when Facebook opted for Mercurial because git couldn't scale to their needs. If anything, it's GitHub issues and PRs that are probably showing its cracks.
bluGill 1 month ago
Git isn't a fantastic choice unless you know nothing about databases. A search would show plenty of research on databases and what works when/why.
- kibwen 1 month ago
  
  For the purposes of the article, git isn't just being used as a database, it's being used as a protocol to replicate the database to the client to allow for offline operation and then keep those distributed copies in sync. And even for that purpose you can do better than git if you know what you're doing, but knowledge of databases alone isn't going to help you (let alone make your engineering more economical than relying on free git hosting).
  
  1 reply →
adastra22 1 month ago
Git is an absolute shit database for a package manager even in the beginning. It’s just that GitHub subsidizes hosting and that is hard to pass up.
- IshKebab 1 month ago
  
  What's a better option? One that keeps track of history and has a nice review interface?
- fn-mote 1 month ago
  
  Sure, but can you back up the expletive with some reason why you think that?
  As it is, this comment is just letting out your emotion, not engaging in dialogue.
  
  2 replies →
venturecruelty 1 month ago

No. No, no, no. Git is a fantastic choice if you want a supply chain nightmare and then Leftpad every week forever.

kibwen 1 month ago

I think there's a form of survivorship bias at work here. To use the example of Cargo, if Rust had never caught on, and thereby gotten popular enough to inflate the git-based index beyond reason, then it would never have been a problem to use git as the backing protocol for the index. Likewise, we can imagine innumerable smaller projects that successfully use git as a distributed delta-updating data distribution protocol, and never happen to outgrow it.

The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.

stickfigure 1 month ago

Right, this post may encourage premature optimization. Cargo, Homebrew, et al chose an easy, good-enough solution which allowed them to grow until they hit scaling limits. This is a good problem to have.
I am sure there's value having a vision for what your scaling path might be in the future, so this discussion is a good one. But it doesn't automatically mean that git is a bad place to start.
inferiorhuman 1 month ago
Keep in mind that crates.io, the main crate registry, uses GitHub as its only authentication method. They may have moved away from git but they're still locked into a rather piss poor vendor.
- kibwen 1 month ago
  
  No, crates.io isn't locked to Github. crates.io uses Github as an identity provider, but there's nothing stopping them from adding more. Furthermore, they've avoided tying themselves to Github in other ways, for example, by resisting all the people just telling them to allow using Github usernames as package namespaces, specifically to prevent them from being locked to Github.
8note 1 month ago
im surprised nobody has made a common db for package managers, so cargo could use it without having to think about it
- kibwen 1 month ago
  
  I mean, it's sort of the other way around. Cargo was built to be able to natively understand git-based dependencies, in the sense that you can bypass a crate registry and instead just point it directly at a git repo. That means that Cargo already had to have the ability to clone git repos, and so when it came to decide how to implement the index (which looks pretty similar to a git repo if you squint), choosing to use git required them to add literally no new dependencies and almost no new code.
  Let's also keep in mind that the use case mentioned in the OP is specifically about the index, which is just the datastructure that informs the version resolver how to resolve versions. When it came time to replace the git-based index, Cargo didn't replace it with a specialized database, it replaced it with HTTP endpoints (which are probably just backed by an off-the-shelf database). It's not clear what sort of specialized database would be useful to abstract this for other package managers.

jama211 1 month ago

“It never works out” - hmm, seems like it worked out just fine, worked great to get the operation of the ground and when scale became an issue it was solvable by moving to something else. It served its purpose, sounds like it worked out to me.

swiftcoder 1 month ago
You appear to have glossed over the two projects in the list that are stuck due to architectural decisions, and don't have any route to migrate off of git-as-database?
- baobun 1 month ago
  
  The issues with nixpkgs stem from that it is a monorepo for all packages and doubling as an index.
  The issues are only fundamental with that architecture. Using a separate repo for each package, like the Arch User Repos, does not have the same problems.
  Nixpkgs certainly could be architected like that and submodules would be a graceful migration path. I'm not aware of discussion of this but guess that what's preventing it might be that github.com tooling makes it very painful to manage thousands of repos for a single project.
  So I think it can be a lesson not to that using git as a database is bad but that using github.com as a database is. PRs as database transactions is clunky and GitHub Actions isn't really ACID.
  
  2 replies →
- hombre_fatal 1 month ago
  
  Be more specific because I just see a list of workarounds deployed once they had the scale to warrant them, supporting the OP’s claim.
  
  2 replies →
- jama211 1 month ago
  
  It’s a fair criticism, and this article does serve well as a warning for people to try and avoid this issue from the start.
efitz 1 month ago
When you start out with a store like git, with file system semantics and a client that has to be smart to handle all the compare and merge operations, then it’s practically impossible to migrate a large client base to a new protocol. Takes years lots of user complaints to and random breakage.
Much better to start with an API. Then you can have the server abstract the store and the operations - use git or whatever - but you can change the store later without disrupting your clients.
- jama211 1 month ago
  
  That costs hosting money no? That might be a bigger problem for someone starting than scalability
leoh 1 month ago
I couldn't agree more strongly. There is a huge opportunity to make git more effective for this kind of use-case, not to abandon it. The essay in question provides no compelling alternative; it therefore reaches an entirely half-baked conclusion.
- jama211 1 month ago
  
  A good point!
lijok 1 month ago
Nooo you don’t get it - it didn’t scale from 0 to a trillion users so it’s a garbage worthless system that “doesn’t scale”.
- zephen 1 month ago
  
  ^^^ Poe's Law may or may not apply to the above comment.

quaintdev 1 month ago

I host my own code repository using Forgejo. It's not public. In fact, it's behind mutual tls like all the service I host. Reason? I don't want to deal with bots and other security risks that come with opening port to the world.

Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.

agwa 1 month ago

If you add .git to the end of your module path and set $GOPRIVATE to the hostname of your Forgejo instance, then Go will not make any HTTPS requests itself and instead delegate to the git command, which can be configured to authenticate with client certificates. See https://go.dev/ref/mod#vcs-find
xyzzy_plugh 1 month ago

> There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https.
No, that's false. You don't need anything to be accessible over HTTP.
But even if it did, and you had to use mTLS, there's a whole bunch of ways to solve this. How do you solve this for any other software that doesn't present client certs? You use a local proxy.
baobun 1 month ago

If you add the instance TLS cert (CA) to your trust store then go will happily download over https. It can be finicky depending on how you run go but I can confirm it works.
irusensei 1 month ago

Have a look at Tailscale DNS and certs. Its gives you a valid cert through lets encrypt without exposing your services to the internet.

newswangerd 1 month ago

It’s always humbling when you go on the front page of HN and see an article titled “the thing you’re doing right now is a bad idea and here’s why”

This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.

In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.

baobun 1 month ago
The good news is you can easier avoid some of the pitfalls now even as you stick with it. Some good points in comments.
I don't know if you rely on github.com but IMO vendor lock-in there might be a bigger issue which you can avoid.
- newswangerd 1 month ago
  
  Yeah, I'm implementing a couple of things to make my life easier in the future. I don't use any github APIs and I'm setting up my clients to load the plugin repo URLs from my server so I can change them later if I need to. I want all of the resources my clients need to come from my domain name so I can move it around if I need to.

jarofgreen 1 month ago

It's not just package manager who do this - a lot of smaller projects crowd source data in git repositories. Most of these don't reach the scale where the technical limitations become a problem.

Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.

There are a few other small problems - but it's interesting to see that so many other projects do this.

I ended up working on an open source software library to help in these cases: https://www.datatig.com/

Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.

Hasnep 1 month ago

Oh, this would have been great for a project I was working on a while ago! I'll have to keep it in mind for the future. Thanks for sharing

hogrug 1 month ago

The facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.

If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.

ifh-hn 1 month ago

So what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?

encom 1 month ago
I quite like Gentoo's rsync based package manager. I believe they've used that since the beginning. It works well.
- MarsIronPI 1 month ago
  
  To be clear though, the rsync trees come from a central Git repo (though it's not hosted on GitHub). And syncing from Git actually makes syncing faster.
AaronFriel 1 month ago

OCI artifacts, using the same protocol as container registries. It's a protocol designed for versioning (tagging) content addressable blobs, associating metadata with them, and it's CDN friendly.
Homebrew uses OCI as its backend now, and I think every package manager should. It has the right primitives you expect from a registry to scale.

dleslie 1 month ago

GitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?

Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

jarofgreen 1 month ago
This seems to be about hosting an Sqlite database on a static website like GitHub Pages - this can be a great plan, there is also Datasette in a browser now: https://github.com/simonw/datasette-lite
But that's different from how you collect the data in a git repository in the first place - or are you suggesting just putting a Sqlite file in a git repository? If so I can think of one big reason against that.
- dleslie 1 month ago
  
  Yes, I'm suggesting hosting it on GitHub, leveraging their git lfs support. Just treat it like a binary blob and periodically update with a tagged release.
  
  2 replies →

cbondurant 1 month ago

Admittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?

Cut out the middle man, directly serve the query response to the package manager client.

(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)

bob1029 1 month ago
There are still no realistic ways to expose a hosted SQL solution to the public without really unhappy things occurring. It doesn't matter which vendor you pick.
Anything where you are opening a TCP connection to a hosted SQL server is a non-starter. You could hypothetically have so many read replicas that no one could blow anyone else up, but this would get to be very expensive at scale.
Something involving SQLite is probably the most viable option.
- IshKebab 1 month ago
  
  Feels like there's an opening in the market there. Why can't you expose an SQL server to the public?
  Also Stackoverflow exposes a SQL interface so it isn't totally impossible.

zX41ZdbW 1 month ago

ClickHouse can do it. Examples:

    https://play.clickhouse.com/

    clickhouse-client --host play.clickhouse.com --user play --secure

    ssh play.clickhouse.com

baobun 1 month ago

Yes but CH is not SQL.

1 reply →

brendoncarroll 1 month ago

I personally think that this is the future, especially since such an architecture allows for E2E encryption of the entire database. The protocol should just be a transaction layer for coordinating changes of opaque blobs.
All of the complexity lives on the client. That makes a lot of sense for a package manager because it's something lots of people want to run, but no one really wants to host.
yawaramin 1 month ago

There's no need to have a publicly accessible database server, just put all the data in a single SQLite database and distribute that to clients. It's possible to do streaming updates by just zipping up a text file containing all the SQL commands and letting clients download that. Or even a more sophisticated option is eg Litestream.
mirekrusin 1 month ago

You can use fossil [0]
[0] https://fossil-scm.org

account42 1 month ago

> Windows restricts paths to 260 characters, a constraint dating back to DOS.

It doesn't if you do it properly.

Ericson2314 1 month ago

The Nixpkgs example is not like the others, because it is source code.

I don't get what is so bad about shallow clones either. Why should they be so performance sensative?

__MatrixMan__ 1 month ago
It also seems like it's not git that's emitting scary creaks and groans, but rather GitHub. As much as it would be a bummer to forgo some of GitHub's nice-to-have features, I expect we could survive without some of it.
- mindslight 1 month ago
  
  Furthermore, the issues given for nixpkgs are actually demonstrating the success of using git as the database! Those 20k forks are all people maintaining their own version of nixpkgs on Github, right? Each their own independent tree that users can just go ahead and modify for their own whims and purposes, without having to overcome the activation energy of creating their own package repository.
  If 83GB (4MB/fork) is "too big" then responsibility for that rests solely on the elective centralization encouraged by Github. I suspect if you could go and total up the cumulative storage used by the nixpkgs source tree distributed on computers spread throughout the world, that is many orders of magnitude larger.
  
  1 reply →
- MarsIronPI 1 month ago
  
  Exactly. Gentoo's main package repo is hosted in Git (but not GitHub, except as a mirror). Now, most users fetch it via rsync, but actually using the Git repo IME makes syncing faster, not slower. Though it does make the initial fetch slower.
kccqzy 1 month ago
Shallow clones themselves aren’t the issue. It’s that updating shallow clones requires the server to spend a bunch of CPU time and GitHub simply isn’t willing to provide that for free.
The solution is simple: using a shallow clone means that the use case doesn’t care about the history at all, so download a tarball of the repo for the initial download and then later rsync the repo. Git can remain the source of truth for all history, but that history doesn’t have to be exposed.
- collinmanderson 1 month ago
  
  Can you rsync a repo from GitHub?
ajb 1 month ago

In a compressed format, later commits would be added as a delta of some kind, to avoid increasing the size by the whole tree size each time. To make shallow clones efficient you'd need to rewrite the compressed form such that earlier commits are instead deltas on later ones, or something equivalent.

twoodfin 1 month ago

What made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.

Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.

amluto 1 month ago
Most of the problems mentioned in the article are not problems with using a content-addressed tree like git or even with using precisely git’s schema. The problems are with git’s protocol and GitHub’s implementation thereof.
Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.
- mananaysiempre 1 month ago
  
  > Git knows how to store [a hash-addressed tree], but git does not know how to transfer it efficiently.
  Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..
  Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.

Zambyte 1 month ago

The issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?

Rucadi 1 month ago

I also got the same feeling from that, in fact, I would go as far as to say that nixpkgs and nix-commands integration with git works quite well and is not an issue.
So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.
The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git
femiagbabiaka 1 month ago

Yeah, it's inclusion in here is baffling because none of the listed issues have anything to do with the particular issue nixpkgs is having.

ori_b 1 month ago

Alternatively: Downloading the entire state of all packages when you care about just one, it never works out.

O(1) beats O(n) as n gets large.

gruez 1 month ago
Seems to still work out for apt?
- ajb 1 month ago
  
  Not in the same sense. An analogy might be: apt is like fetching a git repo in which all the packages are submodules, so lazily fetched. Some of the package managers in the article seem to be using a monorepo for all packages - including the content. Others seem to have different issues - go wasn't including enough information in the top level, so all the submodules had to be fetched anyway. vcpkg was doing something with tree hashes which meant they weren't really addressible.
- collinmanderson 1 month ago
  
  I consider apt kinds slow. I wish it were much faster.

mikkupikku 1 month ago

People who put off learning SQL for later end up using anything other than a database as their database.

redog 1 month ago

SQL killed the set theory star
groundzeros2015 1 month ago
Is sql over ssh a thing?
- yawaramin 1 month ago
  
  https://litestream.io/
  
  2 replies →

bencornia 1 month ago

> Grab’s engineering team went from 18 minutes for go get to 12 seconds after deploying a module proxy. That’s not a typo. Eighteen minutes down to twelve seconds.

> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.

I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?

zahlman 1 month ago
> needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.
Python used to have this problem as well (technically still does, but a large majority of things are available as a wheel and PyPI generally publishes a separate .metadata file for those wheels), but at least it was only a question of downloading and unpacking an archive file, not cloning an entire repo. Sheesh.
Why would Go need to do that, though? Isn't the go.mod file in a specific place relative to the package root in the repo?
- klooney 1 month ago
  
  Go's lock files arrived at around the same time as the proxy, before then you didn't have transitive dependencies pre baked.
fireflash38 1 month ago

How long ago were you having issues? That was changed in go 1.13.

the__alchemist 1 month ago

The Cargo example at the top is striking. Whenever I publish a crate, and it blocks me until I write `--allow-dirty`, I am reminded that there is a conflation between Cargo/crates.io and Git that should not exist. I will write `--allow-dirty` because I think these are two separate functionalities that should not be coupled. Crates.io should not know about or care about my project's Git usage or lack thereof.

cesarb 1 month ago
> The Cargo example at the top is striking. Whenever I publish a crate, and it blocks me until I write `--allow-dirty`, I am reminded that there is a conflation between Cargo/crates.io and Git that should not exist. I will write `--allow-dirty` because I think these are two separate functionalities that should not be coupled.
That's completely unrelated.
The --allow-dirty flag is to bypass a local safety check which prevents you from accidentally publishing a crate with changes which haven't been committed to your local git repository. It has no relation at all to the use of git for the index of packages.
> Crates.io should not know about or care about my project's Git usage or lack thereof.
There are good reasons to know or care. The first one, is to provide a link from the crates.io page to your canonical version control repository. The second one, is to add a file containing the original commit identifier (commit hash in case of git) which was used to generate the package, to simplify auditing that the contents of the package match what's on the version control repository (to help defend against supply chain attacks). Both are optional.
- the__alchemist 1 month ago
  
  Those are great points, and reinforce the concept that there is conflation between Cargo and Git/commits. Commits and Cargo IMO should be separate concepts. Cargo should not be checking my Git history prior to publishing.

frumplestlatz 1 month ago

Since ~2002, Macports has used svn or git, but users, by default, rsync the complete port definitions + a server-generated index + a signature.

The index is used for all lookups; it can also be generated or incrementally updated client-side to accommodate local changes.

This has worked fine for literally decades, starting back when bandwidth and CPU power was far more limited.

The problem isn’t using SCM, and the solutions have been known for a very long time.

ekjhgkejhgk 1 month ago

Uncertain if this is OT, but given that the CCC is politically inspired organization, I hope not:

One thing that still seems absent is awareness of the complete takeover of "gadgets" in schools. Schools these days, as early as primary school, shove screens in front of children. They're expected to look at them, and "use" them for various activities, including practicing handwriting. I wish I was joking [1].

I see two problems with this.

First is that these devices are engineered to be addictive by way of constant notifications/distractions, and learning is something that requires long sustained focus. There's a lot of data showing that under certain common circumstances, you do worse learning from a screen than from paper.

Second is implicitly it trains children to expect that anything has to be done through a screen connected to a closed point-and-click platform. (Uninformed) people will say "people who work with computers make money, so I want my child to have an ipad". But interacting with a closed platform like an ipad is removing the possibilities and putting the interaction "on rails". You don't learn to think, explore and learn from mistakes, instead you learn to use the app that's put in front of you. This in turn reinforces the "computer says no" [2] approach to understanding the world.

I think this is a matter of civil rights and freedom, but sadly I don't often see "civil rights" organizations talk about this. I think I heard Stallman say something along these lines once, but other than that I don't see campaigns anywhere.

[1] https://www.letterjoin.co.uk/

[2] https://youtu.be/eE9vO-DTNZc

AceJohnny2 1 month ago
It looks like you commented on the wrong post, although I don't immediately see a front-page post about the ongoing Chaos Computer Congress.
- kzrdude 1 month ago
  
  it's here https://news.ycombinator.com/item?id=46386211 (and it was last on the front page at the moment)
  
  1 reply →
- ekjhgkejhgk 1 month ago
  
  LOL sorry. You're right. I'll copy paste over there.

teiferer 1 month ago

And this my friends is the reason why (only) focusing on CPU cycles and memory hierarchies is insufficient when thinking of the performance of a system. Yes they are important. But no level of low-level optimization will get you out of the hole that a wrong choice of algorithm and/or data structure may have dug you into.

leoh 1 month ago

The conclusion reached in this essay is 100% wrong. See " The reftable backend What it is, where it's headed, and why should you care?"

>With release 2.45, Git has gained support for the “reftable” backend to read and write references in a Git repository. While this was a significant milestone for Git, it wasn‘t the end of GitLab’s journey to improve scalability in repositories with many references. In this talk you will learn what the reftable backend is, what work we did to improve it even further and why you should care.

https://www.youtube.com/watch?v=0UkonBcLeAo

Also see Scalar, which Microsoft used to scale their 300GiB Windows repository, https://github.com/microsoft/scalar.

mukundesh 1 month ago

Though not Github, worth mentioning Huggingface, which is also using git, but managing large files with their(?) xet protocol. https://huggingface.co/docs/hub/en/xet/index

bandrami 1 month ago

Maybe I'm misreading the article but isn't every example about the downside of using github as a database host, not the downside of using git as a database?

Like, yes, you should host your own database. This doesn't seem like an argument against that database being git.

zzo38computer 1 month ago

Git commits will have a hash and each file will have a hash, which means that locking is unnecessary for read access. (This is also true of fossil, although fossil does have locking since it uses SQLite.)

The other stuff mentioned in the article seems to be valid criticisms.

hk1337 1 month ago

I like Go but it’s dependency management is weird and seems to be centered around GitHub a lot.

andreashaerter 1 month ago
It's mostly tradition rather than a hard requirement. Go has long supported vanity import paths: https://pkg.go.dev/cmd/go#hdr-Remote_import_paths
For example, we use Hugo to provide independent Go package URLs even though the code is hosted on GitHub. That makes migrating away from GitHub trivial if we ever choose to do so (Repo: https://github.com/foundata/hugo-theme-govanity; Example: https://golang.foundata.com/hugo-theme-dev/). Usage works as expected:
go get golang.foundata.com/hugo-theme-dev
Edit: Formatting
Hendrikto 1 month ago

There is nothing tying Go to GitHub.
rewgs 1 month ago

Not at all. It can grab git repos (as well as work with other VCSs). There's just a lot of stuff on GitHub, hence your impression.

themk 1 month ago

I think git is overkill, and probably a database is as well.

I quite like the hackage index, which is an append-only tar file. Incremental updates are trivial using HTTP range requests making hosting it trivial as well.

nacozarina 1 month ago

successful things often have humble origins, it’s a feature not a bug

for every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else

kuahyeow 1 month ago

GitLab employee here. We have completed the move away from Gollum years ago (see https://gitlab.com/groups/gitlab-org/-/epics/2381).

It looks like that doc https://docs.gitlab.com/development/wikis/ was outdated - since fixed to no longer mention Gollum.

PunchyHamster 1 month ago

The article conclusion is just... not good. There are many benefits to using Git as backend, you can point your project to every single commit as a version which makes testing any fixes or changes in libs super easy, it has built in integrity control and technically (sadly not in practice) you could just sign commits and use that to verify whether package is authentic.

It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback

aidenn0 1 month ago

As far as I know, Nixpkgs doesn't use git as a package database. The packages definitions are stored and developed in git, but the channels certainly are not.

shellkr 1 month ago

I am not sure this is necessarily a git issue as it is mostly a GitHub issue. just look at the Aur of Arch Linux which works perfectly.

mikepurvis 1 month ago

The nix cli almost exclusively pulls GitHub as zipballs. Not perfect but certainly far faster than a real git clone.

pxc 1 month ago
That it supports fetching via Git as well as various via forge-specific tarballs, even for flakes, is pretty nice. It means that if your org uses Nix, you can fall back to distribution via Git as a solution that doesn't require you to stand up any new infra or tie you to any particular vendor, but once you get rolling it's an easy optimization to switch to downloading snapshots.
The most pain probably just becomes from the hugeness of Nixpkgs, but I remain an advocate for the huge monorepo of build recipes.
- mikepurvis 1 month ago
  
  Yes agreed. It’s possible to imagine some kind of cached-deltas scheme to get faster/smaller updates, but I suspect the folks who would have to build and maintain that are all on gigabit internet connections and don’t feel the complexity is worth it.
  
  2 replies →

drzaiusx11 1 month ago

I'd add git gemfile dependencies to the list of languages called out here as well. It supports git repos, but in general it's a bad idea unless you are diligent with git tag use and disallow git tag mutability, which also assumes you have complete control of your git dependencies...

krbaccord941x 1 month ago

I understand article is concerning RFC2789, in cloning whole indexes for lang indexes, but /cargo/src shallow-clones need another layer, where tertiary compilation or decompression takes place in mutex libraries, whether its SSL certificate is dependent on HTTP fetch.

drzaiusx11 1 month ago

One of the first things I did at my current place of employment was to detangle the mess of gemfile git dependencies and get them to adopt semver and an actual package repo. There were so many footguns with git dependencies in ruby we were getting taken down by friendly fire on the daily...

skywhopper 1 month ago

Not sure I can agree with the takeaway. It works well at first, but doesn’t scale, so folks found workarounds. That’s how literally every working system grows. There are always bottlenecks eventually. And you address them when they become an issue, not five years earlier.

dromologist 1 month ago

We wanted to pull updated code in our undockerized instances when they were instantiated, so we decided to pull the code from GitHub. Worked out pretty well though after a thousand trials we got a 502 and now we're one step closer to being forced into a CD pipeline.

schnatterer 1 month ago

> Even GitOps tools that embrace git as a source of truth have to work around its limitations

I'd say this only applies to huge scale (or monorepos, as mentioned in the article). Another workaround might be gitless gitops via OCI.

pizlonator 1 month ago

What is the alternative?

"Use a database" isn't actionable advice because it's not specific enough

yawaramin 1 month ago

Use an SQLite database file, stream out delta updates to clients using zipped plaintext or Litestream or something.

holyknight 1 month ago

It’s basically the same thing that always happens when you choose a technology because it’s convenient rather than a great fit for your problem. Sooner or later, you’ll hit a wall. Just because you can cook a salmon in your dishwasher doesn’t mean you should.

ferfumarma 1 month ago

I love this write-up. As a non-expert user of package managers I can quickly understand a set of patterns that have been deeply considered and carefully articulated. Thanks for taking the time to write up your observations!

rldjbpin 1 month ago

the title and core argument do not seem to align much. subject is git, but most discourse is around github. the role discussed is for serving packages, while the title refers to it as "database".

regardless of the semantics, git is not ideal for serving files. this has been more apparent in the ai world, where extensions such as git lfs has allowed larger file size.

but as seen elsewhere, network effects trump over any design issues. we can always introduce an "lfs" for better shallow fetching (cached? compressed?) and this would resolve a majority of the op's grievences.

xpressvideoz 1 month ago

The article lists Git-based wiki engines as a bad usage of Git. Can anybody recommend alternatives? I want something that can be self-hosted, is easily modified by text editors, and has individual page history, preferably with Markdown.

rudolph9 1 month ago

It’s worth considering if these package managers would have taken off if they didn’t use git. You get a bunch for free, why not use it while you’re small?

whytevuhuni 1 month ago

No mention of Guix. Has its situation improved? I remember waiting almost an hour on “guix pull” to catch up with its git repo on a fresh install.

cben 1 month ago

Obligatory link to the gold intro to So. Many. Aspects. of pkg manager design: https://medium.com/@sdboyer/so-you-want-to-write-a-package-m... Even if its section on "Central Package Registry" isn't very deep.

gethly 1 month ago

If we stopped using VCS to fetch source files, we would lose the ability to get the exact commit(understand as version that has nothing to do with the underlying VCS) of these files. Git, Mercurial, SVN.., github, bitbucket...it does not matter. Absolutely nobody will be building downloadable versions of their source files, hosted on who knows how "prestigious" domains, by copying them to another location just to serve the --->exact same content<--- that github and alike already provide.

This entire blog is just a waste of time for anyone reading it.

throwway120385 1 month ago
Or you could just ship a tarball and an sha checksum.
- gethly 1 month ago
  
  you could, in case you want to make only certain releases publicly available. but then, who wants to do that manual labour? we're talking mainstream here, not specific use cases.
forrestthewoods 1 month ago

> This entire blog is just a waste of time for anyone reading it.
Well that’s an extremely rude thing to say.
Personally I thought it was really interesting to read about a bunch of different projects all running into the same wall with Git.
I also didn’t realize that Git had issues with sparse checkouts. Or maybe author meant shallow? I forget.
layer8 1 month ago

And yet, that's pretty much how the Java world works (Maven repositories).

dwardu 1 month ago

Worst thing is when you’re in a an office and your pc along with other pcs pulls from git unauthenticated, then you get hit with api limits

didip 1 month ago

So… What we need is a globally distributed git seeders of all open source github content, then?

Seems possible if every git client is also a torrent client.

encom 1 month ago

>[Homebrew] Auto-updates now run every 24 hours instead of every 5 minutes[...]

That is such an insane default, I'm at a loss for words.

croemer 1 month ago

You mean the 5 minutes is insane, right?
justsomehnguy 1 month ago

Every time I things like I really want to punch people over TCP/IP. UDP wouldn't suffice.

mcny 1 month ago

I want to take a quick detour here if anyone is knowledgeable about this topic.

> The hosting problems are symptoms. The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.

Does this mean mbox is inherently superior to maildir? I really like the idea of maildir because there is nothing to compact but if we assume we never delete emails (on the local machine anyways), does that mean mbox or similar is preferable over maildir?

juped 1 month ago

No, of course not.

nottorp 1 month ago

> Auto-updates now run every 24 hours instead of every 5 minutes

What the... why would you run an autoupdate every 5 minutes?

wg0 1 month ago

Why not use SQLite then as database for package managers? A local copy could be replicated easily with delta fetch.

miyuru 1 month ago

Funnily enough, I clicked the homebrew GitHub link in the post, only to get a rate limited error page from GitHub.

iamwil 1 month ago

This sounds like a missing piece of software in the OSS world. If you have the inclination, you should write it.

yawaramin 1 month ago

People have: https://0install.net/

keithgroves 1 month ago

When building https:/enact.tools we considered this. I'm glad we didn't go this route.

sghiassy 1 month ago

Use the git clone —shallow option and you’ll only download the most recent commits. Yeesh

ZenoArrow 1 month ago
Did you read the article? It references the server-side overhead for shallow clones.
- sghiassy 1 month ago
  
  The “server’ cost you’re referencing is the CI system running git shallow then brew update on GitHubs CI servers.
  
  1 reply →

skylurk 1 month ago

Sounds like it worked pretty well several times? But yeah it does not scale forever.

rurban 1 month ago

You can add Debian to the list of pain. They'll find out soon enough

weiwenhao 1 month ago

For package management software that is rarely used, free is the biggest motivation.

grumbel 1 month ago

Do we have distributed databases that regular users can clone, modify and merge?

born-jre 1 month ago

lol I see this as I plan on using Git for my thing store. https://github.com/blue-monads/potatoverse

pxc 1 month ago

Loved this article. Just enough detail to make the broad scope compatible with a reasonable length, and well-argued.

I feel sometimes like package management is a relatively second-class topic in computer science (or at least among many working programmers). But a package manager's behavior can be the difference between a grotesque, repulsive experience and a delightful, beautiful one. And there aren't quite yet any package managers that do well everything that we collectively have learned how to do well, which makes it an interesting space imo.

Re: Nixpkgs, interestingly, pre-flakes Nix distributes all of the needed Nix expressions as tarballs, which does play nice with CDNs. It also distributes an index of the tree as a SQLite database to obviate some of the "too many files/directories" problem with enumerating files. (In the meantime, Nixpkgs has also started bucketing package directories by name prefix, too.) So maybe there was a lesson learned here that would be useful to re-learn.

On the other hand, IIRC if you use the GitHub fetcher rather than the Git one, including for fetching flakes, Nix will download tarballs from GitHub instead of doing clones. Regardless, downloading and unpacking Nixpkgs has become kinda slow. :-\

BlueTemplar 1 month ago

Wait, isn't fossil based on sqlite ?

Or does fossil itself still have the same issues ?

juped 1 month ago

These are actually all problems with using Github as an ersatz CDN.

venturecruelty 1 month ago

This is why I don't use programming languages that do that.

0xbadcafebee 1 month ago

YOLO software engineering, the hallmark of the 21st century

gjvc 1 month ago

sqlite seems to be ideal for a package manager

sigwinch 1 month ago

I feel like the rqlite people would have a lot to say about how to coordinate your installations, especially for the high-bandwidth non-desktop installs.
https://news.ycombinator.com/item?id=45257349
mirekrusin 1 month ago

...or scm [0]
[0] https://fossil-scm.org

khc 1 month ago

seems like the issue isn't with using git as a database, but using github as a distribution mechanism?

tarun_anand 1 month ago

why dont we have a P2P transfer platform for this (modulo security)

notorandit 1 month ago

Repsy

stephenlf 1 month ago

Omarchy

aniou 1 month ago

As side note. Maybe someone knows, why rust devs chose an already used name for language changes proposal? "RFC" was already taken and well-established and I simply refuse to accept that someone wasn't aware about Request For Comments - and if it was true and clash was created deliberately, then it was rude and arrogant.

Every, ...king time, when I read something like "RFC 2789 introduced a sparse HTTP protocol." my brain suffers from a short-circuit. BTW: RFC 2789 is a "Mail Monitoring MIB".

adastra22 1 month ago
There are many, many RFC collections. Including many that predate the IETF. Some even predate computers.
- aniou 1 month ago
  
  But they were in different domains. Here, we have a strong clash because Rust is positioning itself as secure system and internet language and computer and internet standard are already defined by RFC-s. So, it may be not uncommon, when someone would tell about Rust mechanisms, defined by particular RFC in context of handling particular protocol, defined by... well... RFC too. But not by rust-one.
  Not so smart, when we realize, that one of aspects of secure and reliable system is elimination of ambiguities.
Conan_Kudo 1 month ago

Ask them, don't ask us. They have a public interface, you can ask them to change the name to something unique.

eviks 1 month ago

Indeed, the seductive nature of bad tools lying close to your hand - no need to lift your butt to get them!

unit149 1 month ago

[dead]