Comment by glitchc
1 day ago
No. This is not a solution.
While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.
Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.
Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.
Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.
This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.
I totally agree. This follows a long tradition of Git "fixing" things by adding a flag that 99% of users won't ever discover. They never fix the defaults.
And yes, you can fix defaults without breaking backwards compatibility.
> They never fix the defaults
Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.
So what was the second time the stopped watch was right?
I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.
3 replies →
> The cloned repo might not be compilable/usable since the blobs are missing.
Only the histories of the blobs are filtered out.
> This is a solved problem: Rsync does it.
Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?
When you do a shallow clone, no files would be present. However when doing a full clone you’ll get a full copy of each version of each blob, and what is being suggested is treat each revision as an rsync operation upon the last. And the more times you muck with a file, which can happen a lot both with assets and if you check in your deps to get exact snapshotting of code, that’s a lot of big file churn.
The overwhelming majority of large assets (images, audio, video) will receive near-zero benefit from using the rsync algorithm. The formats generally have massive byte-level differences even after small “tweaks” to a file.
4 replies →
Maybe a manual filter isn't the right solution, but this does seem to add a lot of missing pieces.
The first time you try to commit on a new install, git nags you to set your email address and name. I could see something similar happen the first time you clone a repo that hits the default global filter size, with instructions on how to disable it globally.
> The cloned repo might not be compilable/usable since the blobs are missing.
Maybe I misunderstood the article, but isn't the point of the filter to prevent downloading the full history of big files, and instead only check out the required version (like LFS does).
So a filter of 1 byte will always give you a working tree, but trying to checkout a prior commit will require a full download of all files.
Exactly. If large files suck in git then that's because the git backend and cloning mechanism sucks for them. Fix that and then let us move on.
Would it be incorrect to say that most of the bloat relates to historical revisions? If so, maybe an rsync-like behavior starting with the most current version of the files would be the best starting point. (Which is all most people will need anyhow.)
> Would it be incorrect to say that most of the bloat relates to historical revisions?
Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).
Doing a bit of digging seems to confirm that, considering that git actually does remove a lot of redundant files during the garbage collection phase. It does however store complete files (unlike a VCS like mercurial which stores deltas) so nonetheless it still might benefit from a download-the-current-snapshot-first approach.
2 replies →
"Will they remember to write the filter? Maybe, "
Nothing wrong with "forgetting" to write the filter, and then if it's taking more than 10 minutes, write the filter.
What? Why would you want to expose a beginner to waiting 10 minutes unnecessarily. How would they even know what they did wrong or what's a reasonable time to wait, ask chatgpt "why is my git clone taking 10 minutes"?!
Is this really the best we can do in terms of user experience? No. git need to step up.
Git is not for beginners in general. And large repos are less for beginners.
A beginner will follow instructions in a README "Run git clone" or "run git clone --depth=1
It is a solution. The fact beginners might not understand it doesn't really matter, solutions need not perish on that alone. Clone is a command people usually run once while setting up a repository. Maybe the case could be made that this behavior should be the default and that full clones should be opt-in but that's a separate issue.