I've been checking in large (10s to 100s MBs) tarballs into one git repo that I use for managing a website archive for a few years, and it can be made to work but it's very painful.
I think there are three main issues:
1. Since it's a distributed VCS, everyone must have a whole copy of the entire repo. But that means anyone cloning the repo or pulling significant commits is going to end up downloading vast amounts of binaries. If you can directly copy the .git dir to the other machine first instead of using git's normal cloning mechanism then it's not as bad, but you're still fundamentally copying everything:
$ du -sh .git
55G .git
2. git doesn't "know" that something is a binary (although it seems to in some circumstances), so some common operations try to search them or operate on them in other ways as if they were text. (I just ran git log -S on that repo and git ran out of memory and crashed, on a machine with 64GB of RAM).
3. The cure for this (git lfs) is worse than the disease. LFS is so bad/strange that I stopped using it and went back to putting the tarballs in git.
Source control for large data.
Currently our biggest repository is 17 TB.
would love for you to try it out. It's open source, so you can self host as well.
Why would someone check binaries in a repo? The only time I came across checked binaries in a repo was because that particular dev could not be bothered to learn nuget / MAVEN. (the dev that approved that PR did not understand that either)
Because it’s way easier if you don’t require every level designer to spend 5 hours recompiling everything before they can get to work in the morning, because it’s way easier to just checkin that weird DLL than provide weird instructions to retrieve it, because onboarding is much simpler if all the tools are in the project, …
Because it's (part of) a website that hosts the tarballs, and we want to keep the whole site under version control. Not saying it's a good reason, but it is a reason.
I've been checking in large (10s to 100s MBs) tarballs into one git repo that I use for managing a website archive for a few years, and it can be made to work but it's very painful.
I think there are three main issues:
1. Since it's a distributed VCS, everyone must have a whole copy of the entire repo. But that means anyone cloning the repo or pulling significant commits is going to end up downloading vast amounts of binaries. If you can directly copy the .git dir to the other machine first instead of using git's normal cloning mechanism then it's not as bad, but you're still fundamentally copying everything:
2. git doesn't "know" that something is a binary (although it seems to in some circumstances), so some common operations try to search them or operate on them in other ways as if they were text. (I just ran git log -S on that repo and git ran out of memory and crashed, on a machine with 64GB of RAM).
3. The cure for this (git lfs) is worse than the disease. LFS is so bad/strange that I stopped using it and went back to putting the tarballs in git.
This is a problem that occurs across game development to ML datasets.
We built oxen to solve this problem https://github.com/Oxen-AI/Oxen (I work at Oxen.ai)
Source control for large data. Currently our biggest repository is 17 TB. would love for you to try it out. It's open source, so you can self host as well.
Why would someone check binaries in a repo? The only time I came across checked binaries in a repo was because that particular dev could not be bothered to learn nuget / MAVEN. (the dev that approved that PR did not understand that either)
Because it’s way easier if you don’t require every level designer to spend 5 hours recompiling everything before they can get to work in the morning, because it’s way easier to just checkin that weird DLL than provide weird instructions to retrieve it, because onboarding is much simpler if all the tools are in the project, …
And it’s no sweat off p4’s back.
5 replies →
Because it's (part of) a website that hosts the tarballs, and we want to keep the whole site under version control. Not saying it's a good reason, but it is a reason.
[dead]