I made my own Git

12 days ago (tonystr.net)

Nice work! On a complete tangent, Git is the only SCM known to me that supports recursive merge strategy [1] (instead of the regular 3-way merge), which essentially always remembers resolved conflicts without you needing to do anything. This is a very underrated feature of Git and somehow people still manage to choose rebase over it. If you ever get to implementing merges, please make sure you have a mechanism for remembering the conflict resolution history :).

[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...

  • I remember in a previous job having to enable git rerere, otherwise it wouldn't remember previously resolved conflicts.

    https://git-scm.com/book/en/v2/Git-Tools-Rerere

    • I believe rerere is a local cache, so you'd still have to resolve the conflicts again on another machine. The recursive merge doesn't have this issue — the conflict resolution inside the merge commits is effectively remembered (although due to how Git operates it actually never even considers it a conflict to be remembered — just a snapshot of the closest state to the merged branches)

      3 replies →

    • The recursive merge is about merging branches that already have merges in them, while rerere is about repeating the same merge several times.

    • Rerere is dangerous and counterproductive - it tries to give rebase the same functionality that merge has, but since rebase is fundamentally wrong it only stacks the wrongness.

      1 reply →

  • Much more principled (and hence less of a foot-gun) way of handling conflicts is making them first class objects in the repository, like https://pijul.org does.

    • Jujutsu too[0]:

      > Jujutsu keeps track of conflicts as first-class objects in its model; they are first-class in the same way commits are, while alternatives like Git simply think of conflicts as textual diffs. While not as rigorous as systems like Darcs (which is based on a formalized theory of patches, as opposed to snapshots), the effect is that many forms of conflict resolution can be performed and propagated automatically.

      [0] https://github.com/jj-vcs/jj

    • I feel like people making new VCSes should just re-use GIT storage/network layer and innovate on top of that. Git storage is flexible enough for that, and that way you can just.... use it on existing repos with very easy migration path for both workflows (CI/CD never need to care about what frontend you use) and users

      3 replies →

  • New to me was discovering within the last month that git-merge doesn't have a merge strategy of "null": don't try to resolve any merge conflicts, because I've already taken care of them; just know that this is a merge between the current branch and the one specified on the command-line, so be a dutiful little tool and just add it to your records. Don't try to "help". Don't fuck with the index or the worktree. Just record that this is happening. That's it. Nothing else.

    • Doesn't `git merge -s ours` do this?

          This resolves any number of heads, but the resulting tree of the merge is always
          that of the current branch head, effectively ignoring all changes from all other
          branches. It is meant to be used to supersede old development history of side
          branches. Note that this is different from the -Xours option to the ort merge strategy.

  • I hate git squash, it only goes one direction and personally I dont give a crap if it took you 100 commits to do one thing, at least now we can see what you may have tried so we dont repeat your mistakes. With git squash it all turns into, this is what they last did that mattered, and btw we cant merge it backwards without it being weird, you have to check out an entirely new branch. I like to continue adding changes to branches I have already merged. Not every PR is the full solution, but a piece of the puzzle. No one can tell me that they only need 1 PR per task because they never have a bug, ever.

    Give me normal boring git merges over git squash merges.

  • That's something new to me (using git for 10 years, always rebased)

    • I'm even more lazy. I almost always clone from scratch after merging or after not touching the project for some time. So easy and silly :)

      I always forget all the flags and I work with literally just: clone, branch, checkout, push.

      (Each feature is a fresh branch tho)

  • as far as I understand the problem (sorry, the SO isn't the clearest around), Fossil should support this operation. It does one better, since it even tracks exactly where merges come from. In Git, you have a merge commit that shows up with more than one parent, but Fossil will show you where it branched off too.

    Take out the last "/timeline" component of the URL to clone via Fossil: https://chiselapp.com/user/chungy/repository/test/timeline

    See also, the upstream documentation on branches and merging: https://fossil-scm.org/home/doc/trunk/www/branching.wiki

Great writeup! It's always fun to learn the details of the tools we use daily.

For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.

1. https://jwiegley.github.io/git-from-the-bottom-up/

If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.

Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

  • Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

    Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

    The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

    • Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

    • I don't really get why they need to clone in order to scrape ...?

      > It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

      That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

      The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.

      3 replies →

  • > Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

    Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.

  • I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?

  • Random aside about training data:

    One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

    • There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."

      I wish I could find it again, if someone else knows the link please post it!

      7 replies →

Nice post :). It made me think of ugit: DIY Git in Python [1] which is still by far my favorite of this kind of posts. It really goes deep into Git internals while managing to stay easy to follow along the way.

[1] https://www.leshenko.net/p/ugit/

Me too. Version control is great, it should get more use outside of software.

https://github.com/gotvc/got

Notable differences: E2E encryption, parallel imports (Got will light up all your cores), and a data structure that supports large files and directories.

  • The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.

    • > The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.

      Yeah, totally agree. Got has not solved conflict resolution for arbitrary files. However, we can tell the user where the files differ, and that the file has changed.

      There is still value in being able to import files and directories of arbitrary sizes, and having the data encrypted. This is the necessary infrastructure to be able to do distributed version control on large amounts of private data. You can't do that easily with Git. It's very clunky even with remote helpers and LFS.

      I talk about that in the Why Got? section of the docs.

      https://github.com/gotvc/got/blob/master/doc/1.1_Why_Got.md

Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.

I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.

> If I were to do this again, I would probably use a well-defined language like yaml or json to store object information.

I know this is only meant to be an educational project, but please avoid yaml (especially for anything generated). It may be a superset of json, but that should strongly suggest that json is enough.

I am aware I'm making a decade old complaint now, but we already have such an absurd mess with every tool that decided to prefer yaml (docker/k8s, swagger, etc.) and it never got any better. Let's not make that mistake again.

People just learned to cope or avoid yaml where they can, and luckily these are such widely used tools that we have plenty of boilerplate examples to cheat from. A new tool lacking docs or examples that only accepts yaml would be anywhere from mildly frustrating to borderline unusable.

Reminds me of when I tried to invent a SPA framework. So much hidden complexity I hadn’t thought of and I found myself going down rabbit holes that I am sure the creators of React and Angular went down. Git seems to be like this and I am often reminded of how impressive it is at hiding underlying complexity.

  • > at hiding underlying complexity.

    It's only in the context of recreating Git that this comment makes sense.

> If you want to look at the code, it's available on github.

Why not tvc-hub :P

Jokes aside, great write up!

  • haha, maybe that's the next project. It did feel weird to make git commits at the same time as I was making tvc commits

It’s really a shame git storage use files as the unit for storage. That’s what makes it improper for usage with many of small files, or large files.

Content-based chunking like Xethub uses really should become the default. It’s not like it’s new either, rsync is based on it.

https://huggingface.co/blog/xethub-joins-hf

Learning git internals was definitely the moment it became clear to me how efficient and smart git is.

And this way of versionning can be reused in other fields, as soon as have some kind of graph of data that can be modified independently but read all together then it makes sense.

>The hardest part about this project was actually just parsing.

How about using sqlite for this? Then you wouldn't need to parse anything, just read/update tables. Fast indexing out of the box, too.

> These objects are also compressed to save space, so writing to and reading from .git/objects/ will always involve running a compression algoritm. Git uses zlib to compress objects, but looking at competitors, zstd seemed more promising:

That's a weird thing to put so close to the start. Compression is about the least interesting aspect of Git's design.

  • When you are learning, everything is important. I think it is okay to cut the person some slack regarding this.

    • Yes, probably.

      It's just that git does a much more interesting job with compression, actually. Lot's more to learn. They don't compress the snapshots via something like zstd directly, that comes much later after a delta step. (Interestingly, that delta compression step doesn't use the diffs that `git show` shows you for your commits.)

Does this git include empty folder? I always annoy that it's not track empty folder.

  • Actually, the Git data model supports empty directories, however, the index doesn't since it only maps names to files but not to directories. You can even create a commit with a root directory using --allow-empty, and it will use the hardcoded empty tree object (4b825dc642cb6eb9a060e54bf8d69288fbee4904).

  • yep! Had to check to be sure:

        Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
         Running `target/debug/tvc decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5`

    === args === decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5 === tvcignore === ./target ./.git ./.tvc

    === subcommand === decompress ------------------ tree ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e5ca8aeac0f5a6566fe32e76125d627af4edf2db97

    • huh, cool. what happens if you use vanilla-git to clone a repo that contains empty folders? and do forges like github display them properly?

"Though I suck at it, my go-to language for side-projects is always Rust"

Hmm, dont be so hard on yourself!

proceeds to call ls from rust

Ok nevermind, although I dont think rust is the issue here.

(Tony I'm joking, thanks for the article)

Ftr you can make repos with sha256 now.

I wonder if signing sha-1 mitigates the threat of using an outdated hash.

I do wonder if the compression step makes sense at this layer instead of the filesystem layer.

  • Interesting take. I'm using btrfs (instead of ext4) with compression enabled (using zstd), so most of the files are compressed "transparently" - the files appear as normal files to the applications, but on disk it is compressed, and the application don't need to do the compress/decompress.

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

  • It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

    Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

    For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

    • As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.

      1 reply →

  • Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set

    • It's mixed. You get something in the neighborhood of a 3-4x speedup with SHA-NI, but the algorithm is fundamentally serial. Fully parallel algorithms like BLAKE3 and K12, which can use wide vector extensions like AVX-512, can be substantially faster (10x+) even on one core. And multithreading compounds with that, if you have enough input to keep a lot of cores occupied. On the other hand, if you're limited to one thread and older/smaller vector extensions (SSE, NEON), hardware-accelerated SHA-256 can win. It can also win in the short input regime where parallelism isn't possible (< 4 KiB for BLAKE3).

I wonder if in the near future there will be no tools anymore in the sense we know it. you will maybe describe the tool you need and its created on the fly.

nice work! This is one of the best ways to deeply learn something, reinvent the wheel yourself.

Nice work, it's always interesting to see how one would design their own VCS from scratch, and see if they fall into problems existing implementations fell into in the past and if the same solution was naturally reached.

The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.

I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).

> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets

This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/

I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.

On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.

--

Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.

  • You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.

    In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.

    >why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.

    • > A timestamp wouldn't be reliable

      I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.

      I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/ But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)

      > I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...

      Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.

      2 replies →

Why introduce yet another ignore file? Can you have it read .gitignore if .tvcignore is missing?