Comment by uasi

18 days ago

Git can display diff between binary files using custom diff drivers:

> Put the following line in your .gitattributes file: *.docx diff=word

> This tells Git that any file that matches this pattern (.docx) should use the “word” filter when you try to view a diff that contains changes. What is the “word” filter? You have to set it up [in .gitconfig].

https://git-scm.com/book/en/v2/Customizing-Git-Git-Attribute...

In their 'Git is unsuited for applications' blog post[0] they also say the following:

> We currently have to clone the whole repository just to edit translation files. That is problematic for big repositories. The repository for posthog.com for example is ~680MB in size. Even though we only need translation files which would be at max 1MB in size, we have to clone the whole repository. That is also one of the reasons why git is not used at Facebook, Google & Co which have repository sizes in the gigabytes.

I get that it can be a bit complex, but Git can handle this circumstance pretty easily if you know how (or write a script for it).

For example, cloning the GIMP repo from GitLab takes me about 56 seconds and uses up 632 MB on disk, using just `git clone <repo>`.

In comparison, running these commands:

    git clone --quiet --filter=blob:none --sparse https://gitlab.gnome.org/GNOME/gimp.git gimp-sparse-clone
    git -C gimp-sparse-clone sparse-checkout add po po-libgimp po-plug-ins po-python po-script-fu po-tags po-tips po-windows-installer

(You can also run `git sparse-checkout init --no-cone` and then just `git sparse-checkout add *.po` to grab every .po file in the repo and nothing else)

Takes 14 seconds on my laptop and uses 59 MB of disk space, and checks out only the specified directories and their contents.

So yeah, it's not as automatic as one might like but ship a shell script to your translators and you're good to go. The 'Git can't do X' arguments are mostly untrue; it should really be 'Getting git to do X is more complicated than I would prefer' or 'Explaining how to do X is git is a pain', both of which are legitimate complaints.

[0] https://samuelstroschein.com/blog/git-limitations/

Would be interesting to see some tooling built around being a custom diff driver for a bunch of different standard formats!

  • I had some interesting luck with the generic approach to unzip the DOCX/XLSX/ODT/etc, then to the contents recursively apply other filters like XML and JSON formatters/prettifiers.

    (My work [1] in this space predated git so it wasn't written as a git diff filter, instead it automated source control. But the same principles could be used in the other direction.)

    Not the highest level diffs you could possibly get, but at least for a programmer even ugly XML and JSON diffs were still nice to have over binary diffs.

    [1] https://github.com/WorldMaker/musdex

This is great for showing diffs. To actually make git store only deltas, not entire binaries, you would need to configure "clean" and "smudge" filters for the format. Given that docx (and xlsx) are a bunch of XML files compressed by zip, you can actually have clean diffs, and small commits.

Yeah, this is how I would prefer to solve this problem personally, but it would be really nice to have some collection of tools that cover common binary file formats automatically instead of having to configure this manually every time.

This is really great. I read the Git config article, but I thought the image diff example was kinda lackluster. Im sure some better metrics could be extracted for a more descriptive diff.

Thanks for sharing!