To me, it makes jujutsu look like the Nix of VCSes.
Not meaning to offend anyone: Nix is cool, but adds complexity. And as a disclaimer: I used jujutsu for a few months and went back to git. Mostly because git is wired in my fingers, and git is everywhere. Those examples of what jujutsu can do and not git sound nice, but in those few months I never remotely had a need for them, so it felt overkill for me.
Tbf you wouldn't use/switch to jj for (because of) those kind of commands, and are quite the outlier in the grand list of reasons to use jj. However the option to use the revset language in that manner is a high-ranking reason to use jj in my opinion.
The most frequent "complex" command I use is to find commits in my name that are unsigned, and then sign them (this is owing to my workflow with agents that commit on my behalf but I'm not going to give agents my private key!)
jj log -r 'mine() & ~signed()'
# or if yolo mode...
jj sign -r 'mine() & ~signed()'
I hadn't even spared a moment to consider the git equivalent but I would humbly expect it to be quite obtuse.
No, jj is super simple in daily use, in contrast with git that is a constant chore (and any sane person use alias). This include stuff that in git is a total mess of complexity like dealing with rebases. So not judge the tool for this odd case.
I don’t understand how people can remember all these custom scripting languages. I can’t even remember most git flags, I’m ecstatic when I remember how to iterate over arrays in “jq”, I can’t fathom how people remember these types of syntaxes.
I am convinced that the vast majority of professionals simply don't bother to remember and, ESPECIALLY WITH GIT, just look stuff up every single time the workflow deviates from their daily usage.
At this point perhaps a million person-years have been sacrificed to the semantically incoherent shit UX of git. I have loathed git from the beginning but there's effectively no other choice.
That said, the OP's commands are useful, I am copying them (because obviously I won't ever memorize them).
I don't, I will google things and fiddle, then put it in a git alias (with a comment on what it does and / or where I got it from) and push it to my private dotfiles repo, taking it with me between computers and projects.
jj's template and revset languages are very simple syntactically, so once you're comfortable with the few things you do use often it's just a question of learning about the other existing functions (even if only enough to know to look them up), which slot right in and compose well with everything else you know (unlike flags which typically have each their own system).
Or, perhaps better yet, defining your own functions/helpers as you go for things you might care about, which, by virtue of having been named you, are much easier to remember (and still compose nicely).
You research it once, use it and then remember that it has "ancestor" in the command somewhere and then use ctrl + R to dig up something from your shell history.
Some things are idioms that one repeats so often they just stick, e.g. I use "grep.... | cut -c x-y | sort | uniq -c | sort -nr" to quickly grep frequency of some events from a log file.
Don't feel bad - no one remembers them all, we just remember a few idioms we use...
If I look something up twice, I record it in Obsidian. If I need it more than a couple of times, I'll probably make an alias, a script or a mask [1] file. Autocomplete and autosuggest are essential to my workflow. And good history search.
If you don't have to codedive new projects all the time, there's zero reason to memorize these. If your job is to look at new codebases all the time, you probably learn to remember these commands pretty quickly.
a project isn’t dying because of no commits. Rather it’s stable
I often feel I need to setup bots to make superfluous commits just to make it look like my useful and stable repos are “active”
One example (not mine) a a qr-code generator library. Hasn’t been updated in 10 years. It’s perfect as is. It just provides the size and the bits. You convert those bits to any representation you want. It has no need to be updated
In a real company? A private codebase at a minimum should still be getting regular security patching and dependency updates. Always eventually one of those updates requires some level of refactor. If I see a project with no commits, I run away.
It's rare, I think, for a project to have such a well defined and singular purpose that has not changed in 10 years nor have any bugs been discovered or its dependencies changed underneath it.
It's not impossible, of course, but if I saw even a qr library that hadn't changed in 10 years I would worry that it wouldn't build on current systems (due to dependencies) and that nobody was actually using it (due to lag of bug reports).
This might be true for libraries or utilities that have a well-defined scope and no dependencies, but that's not what the article is focused on. When considering a company's main product, it's usually never done and patterns of activity—and especially changes in those patterns—can give you insight into potential issues.
> a project isn’t dying because of no commits. Rather it’s stable
Agreed. Assuming there are no open issues and PRs. When I find a project, if the date of the last commit is old, I next look at the issues and PRs. If there are simple-to-deal-with issues (e.g. a short question or spam) and easy-to-merge PRs (e.g. fixing a typo in the README) which have been left lingering for years, it’s probably abandoned. Looking at the maintainer’s GitHub activity graph should provide more clues.
> I often feel I need to setup bots to make superfluous commits just to make it look like my useful and stable repos are “active”
I have never done it, but a few times thought about making a “maintenance” release to bump the version number and release date, especially since I often use a variant of calendar versioning.
I don't want to program git, I want to get stuff done so I would reject using that tool and do what the article author did running tried and true pipeable Linux/UNIX commands. It's also the same reason why I dislike Gradle and use Maven, I don't want to program my build I want to define and run my build.
But the git commands in the article is also programming of the same kind, just using more terse, more obscure language. All the shell pipelines are sort, uniq, and grep.
A language that properly maps to the data model, and has readable identifiers is a boon. Git is a database, a database needs a proper query language.
Hah someone really looked at jq (?) and thought: "yes, more of this everywhere". I feel jq is like marmite (edit: aka vegemite, i.e. "you either love it or you hate it")
It's really not that bad, although the jq comparison might be apt. You have such primitives you need to understand, and then everything just fits together nicely. I find this much easier to write and understand than git's cryptic format strings.
I love how the author thinks developers write commit messages.
All joking aside, it really is a chronic problem in the corporate world. Most codebases I encounter just have "changed stuff" or "hope this works now".
It's a small minority of developers (myself included) who consider the git commit log to be important enough to spend time writing something meaningful.
AI generated commit messages helps this a lot, if developers would actually use it (I hope they will).
And in a squash and merge workflow, which are most teams I've been on the past 8 years, it really is the title of the pull request or merge request. That is what really matters.
And I really like that because it leaves room to let the developer do whatever kind of commit messages they want to that makes sense to them. Because nobody's really ever going to read those again after it squashed and merged.
This is a team lead/CTO problem. A good leader will be explicit in their expectations that developers write good commit messages. I've certainly had good leaders that expect this.
Yes, and a culture problem, too. I guess I've been blessed that I've mostly only worked for "grown up" companies, but I've never encountered a workplace where people didn't write useful commit messages. At least one line description of the work done, but often multiple lines of valuable context. Only the junior devs had to be told to do it, but once they got into the habit, everyone understood why we do it and it was no big deal.
If I joined a company where people committed their code with "stuff" or "made some changes" or "asdfhlfo;ejfo;ae," that would be a red flag that I might have joined the wrong company, and I'd start to wonder what else the developers here do carelessly.
I once tutored an intern. Who thought he was The Best Programmer On Earth (didn't we all at that age?).
He refused to use revision control, it slowed him down.
So we told him to commit at least once every day, with a relevant commit message, or else fail his internship.
He worked 21 more days. There were 21 commits: "17:00, time to go home".
In codebases where PRs are squashed on merge, the commit messages on the main branch end up being the PR body description text, and that's actually reviewed so tends to be much better I find.
I've seen it be the concatenated individual git commit messages way too often. Just a full screen scroll of "my hands are writing letters" and "afkifrj". Still better than if we had those commits individually of course, but dear god.
The gold standard is rebased linear unsquashed history with literary commits, but I'll take merged and squashed PR commits with sensible commit messages without complaint.
And in every codebase I've been in charge of, each PR has one or more issue # linked which describe every possible antagonizing detail behind that work.
I understand this isn't inline with traditional git scm, but it's a very powerful workflow if you are OK with some hybridization.
Only two of the five depend on commit messages. Churn, authorship, and velocity work regardless. Even teams with terrible hygiene write "fix" when something breaks.
Our small team has a lot of commit messages like this. For a while, we had a guy on the team who had come from a site that expected more. The pet peeve he brought along was that commit messages end with a period (my guess is that someone at their previous work place had reasoned that forcing periods encouraged developers to actually write meaningful sentences). When I look at that period of development, I see lots of messages like “stuff changed.” And “more stuff changed.” And then it goes back to just “stuff changed” around the time they moved on.
> my guess is that someone at their previous work place had reasoned that forcing periods encouraged developers to actually write meaningful sentences
I have actually seen proper capitalization and correct conventional-commit types to correlate very well with the author being intentional and the patch being of good quality.
e.g.
- (a) chore: update some_module to include new_func
- (b) feat: Add new_func to handle XYZ case
Where:
(a) is not a chore, as it changes functionality, is uncapitalized and is so low-signal I can probably write a 10 line script to reliably generate similar titles.
(b) is using the correct "feat" commit type, capitalized and describe what this is for. I expect the body to explain "why", as well, and not to reiterate the "how" in natural language.
This is just my experience, but I've seen commit messages where people actually put in some effort to usually come with a good patch, and vice-versa.
Is it really a small minority? I have never worked on a project that didn't have commit messages that at least tried to be descriptive (sometimes people fail at it but its very different to an outright "changed stuff").
I don't remember any friend mentioning to me them encountering a work project where the messages were totally neglected either.
Never seen that in any company I worked at either and I can’t believe professional developers seem to think that it would be ok to write meaningless commit messages. That’s just so sloppy.
tbh I'm not convinced that a git log history should be treated as a group journal because it's not.
relying on git commit messages assumes they're correct by convention since there is no technical constraint to enforce it. and it assumes no work in progress commits, sometimes it's just necessary to hit the save button real quick or move a workspace from one device to another.
my point is:
git is a way of storing and loading files at its core.
Only two of the five insights are based on commit messages and the author acknowledges that they won't work in projects without message discipline. But the remaining ones will give you valuable insights even into the most lazy project department.
It's because the vast majority of commit messages are never read by anyone, and there's other ways to fund out what happened in the handful of cases where you would need to.
I read commit messages all the time to figure out what a change was about.
For small personal projects I often write one phrase messages with `-m`, but if you're working with other people you should be writing good commit messages.
I ran these commands on a number of codebases I work on and I have to say they paint a very different picture than the reality I know to be true.
> git shortlog -sn --no-merges
Is the most egregious. In one codebase there is a developer's name at the top of the list who outpaced the number 2 by almost 3x the number of commits. That developer no longer works at the company? Crisis? Nope, the opposite. The developer was a net-negative to the team in more ways than one, didn't understand the codebase very well at all, and just happened to commit every time they turned around for some reason.
Yeah. I am the top committer at my current workplace, but I'd say that a majority of that gap is because my particular workflow results in many smaller commits than my coworkers.
Everything in context. This is one of many reasons I'm a proponent of squash-and-merge. If a change really needs more than one permanent commit, it should probably be split up or if absolutely necessary should be on a feature branch maintaining similar process. Under this process, feature branches are not squashed.
This leaves developers to commit locally and comment as much or little as they like.
I'm not saying more commits = bad developer. In my example that happened to be the case but not because they had a lot of commits but because they were bad at their job. I was just trying to warn that taking these git snippets at face-value does not paint the full picture.
If someone came to me and said "I ran these and I see XXX was the most prolific committer and they left X months ago, what will be do???" I'd have to work hard not to laugh.
Since these snippets are self-described as ways to get familiar with the code/projects I wanted to provide the counter point. Most of those snippets do not at all paint the real picture and for all the repos I tested it on they paint the opposite of reality.
I know these codebases like the back of my hand, the purported purpose of these snippets is to better understand the codebase, I can tell you they don't work for anything I tested them on. Maybe they work for other codebases but the sample size I have access to says they don't work for me.
> The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”
The most changed file is the one people are afraid of touching?
I've just tried this, and the most touched files are also the most irrelevant or boring files (auto generated, entry-point of the service etc.) in my tests.
I just tried it too and it basically just flagged a handful of 1500+ line files which probably ought to be broken up eventually but arent causing any serious problems.
Plotting Churn against Complexity is far more useful than merely churn.
It shows places that are problematic much better. High churn, low complexity: fine. Its recognized and optimizef that this is worked on a lot (e.g. some mapping file, a dsl, business rules etc). Low churn high complexity: fine too. Its a mess, but no-one has to be there.
But both? Thats probably where most bugs originate, where PRs block, where test coverage is poor and where everyone knows time is needed to refactor.
In fact, quite often I found that a teams' call "to rewrite the app from scratch" was really about those few high-churn-high-complexity modules, files or classes.
Complexity is a deep topic, but even simple checks like how nested smt is, or how many statements can do.
This command needs a warning. Using this command and drawing too many conclusions from it, especially if you’re new, will make you look stupid in front of your team mates.
I ran this on the repo I have open and after I filtered out the non code files it really can only tell me which features we worked on in the last year. It says more about how we decided to split up the features into increments than anything to do with bugs and “churn”.
I found it interesting, that Git itself has built in similarity notion... when it packs objects, it groups files by path+size, runs delta cmpression to find which are close.
Better for people to know they're just blindly copying tools and parroting their output as if it's automatically meaningfully. Any warning against that should be built into the individual, for their own sake
Yes. Because the fear is butressed with necessity. You have to edit the file, and so does everyone else and that is a recipe for a lot of mess. I can think back over years of files like this. Usually kilolines of impossible to reason about doeverything.
Definitely not in my experience. The most changed are the change logs, files with version numbers and readmes. I don't think anyone is afraid of keeping those up to date.
pom.xml and package.json came up on couple of separate projects I ran the commands on. Which makes sense because the versions get bumped rather frequently. I guess context matters, as usual.
Curious - why write it as a function in presumably .gitconfig and not just a git-summary script in your path? Just seems like a lot of extra escapes and quotes and stuff
Note this doesn't work by default on macOS, where git uses the POSIX version of grep (which doesn't support `\b`). You need to use the Perl Regexp option by switching -E with -P:
> One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.
In my experience, when the team doesn't squash, this will reflect the messiest members of the team.
The top committer on the repository I maintain has 8x more commits than the second one. They were fired before I joined and nobody even remembers what they did. Git itself says: not much, just changing the same few files over and over.
Of course if nobody is making a mess in their own commits, this is not an issue. But if they are, squash can be quite more truthful.
I wouldn't trust "commit counts." The quality and content of a "commit" can vary widely between developers. I have one guy on my team who commits only working code that has been thoroughly tested locally, another guy who commits one line changes that often don't work, only to be followed by fixes, and more fixes. His "commits" have about 1/100th of the value of the first guy.
My comment still seems relevant? Do frequent commits to correct mistakes imply more "value" than infrequent, but well tested, commits, or what? I don't think it is a reliable signal.
Yeah, this one demonstrates a particularly pernicious view of software development. One where growth, no matter how artificial, is the only sign of success.
If you work with service oriented software, the projects that are "dying" may very well be the most successful if it's a key component. Even from a business perspective having to write less code can also be a sign of success.
I don't know why this was overlooked when the churn metric is right there.
Whenever we initiated a new (internal) SW project, it had to go through an audit. One of the items in the checklist for any dependency was "Must have releases in the last 2 years"
I think the rationale was the risk of security vulnerabilities not being addressed, but still ...
That was my question too. I have plenty of projects I've worked on where they rarely get touched anymore. They don't need new features and nothing is broken.
Technically you're correct that change frequency doesn't necessarily mean dead, but the number of projects that are receiving very few updates because they're 'done' is a fraction of a fraction of a percent compared to the number that are just plain dead. I'm certain you can use change frequency as a proxy and never be wrong.
Rather than using an LLM to write fluffy paragraphs explaining what each command does and what it tells them, the author should have shown their output (truncated if necessary)
This one was funny to me because sure, it was accurate for my particular codebase, but also anyone paying attention to the company Slack would already know how often fires happen.
Solid list. I'd add git log --all --oneline --graph pretty early on — gives you a quick sense of how active different branches are and whether this is a "one person commits everything" project or actually distributed. Helped me a ton on a job where I inheritied a monolith with like 4 years of history.
The git blame tip is underrated. People treat it like a gotcha tool but its maybe the fastest way to find the PR/ticket that explains a weird decision.
and it touches in detail what exactly commit standards should be, and even how to automate this on CI level.
And then I also have idea/vision how to connect commits to actual product/technical/infra specs, and how to make it all granular and maintainable, and also IDE support.
I would love to see any feedback on my efforts. If you decide to go through my entire 3 posts I wrote, thank you
Most good projects end up solving a problem permanently and if there is no salary to protect with bogus new features it is then to be considered final?
> One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.
I abhor squash merging for this and a few other reasons. I literally have to go out of my way to re-check out a branch. Someone who wants to use my current branch cannot do so if I merge my changes a month later, because the squash rewrites history, and now git is very confused. I don't get the obsession with "cleaning up the history" as if we're all always constantly running out of storage over 2 more commits.
For me the benefit is that I can revert or cherry-pick things one entire PR at a time, and I don't have to care if the author implemented their PR with a bunch of small "work in progress" commits.
And GitHub at least sets the author of the squashed commit as the one who opened the PR, not the one who merged it.
I can definitely see where it wouldn't work well for other workflows but I've had it work well on several teams and it seems easier than trying to get everyone to clean up their commits into nice, clean, well-titled histories before putting up a PR.
> If the team squashes every PR into a single commit, this output reflects who merged, not who wrote.
Squash-merge workflows are stupid (you lose information without gaining anything in return as it was easily filterable at retrieval anyway) and only useful as a workaround for people not knowing how to use git, but git stores the author and committer names separately, so it doesn't matter who merged, but rather whether the squashed patchset consisted of commits with multiple authors (and even then you could store it with Co-authored-by trailers, but that's harder to use in such oneliners).
Can you explain to me (an avid squash-merger) what extra information do you gain by having commits that say "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", "pushing before leaving for vacation" in the main git history?
With a squash merge one PR is one commit, simple, clean and easy to roll back or cherry-pick to another branch.
These commits reaching the reviewer are a sign of either not knowing how to use git or not respecting their time. You clean things up and split into logical chunks when you get ready to push into a shared place.
If someone uses git commits like the save function of their editor and doesn't write messages intended for reading by anyone else, it makes sense to want to hide them
For other cases, you lose the information about why things are this way. It's too verbose to //comment on every like with how it came to be this way but on (non-rare in total, but rare per line) occasion it's useful to see what the change was that made the line be like this, or even just who to potentially ask for help (when >1 person worked on a feature branch, which I'd say is common)
Part of new feature you had working in an intermediate commit, but broke somewhere along the way and is not working in your last commit when you squashed.
If you catch it early enough, I suppose it's in your reflog, but otherwise you're screwed.
It sounds like a silly example, but I bet most developers have run into this at some point.
With mercurial/jujutsu, you get the best of both worlds: The "argh, let's see if this works" commits are what I call "microcommits", and the squashed versions are the real/public commits. With jujutsu, you get both. Your log shows only the "real" commits (equivalent of squashing all the commits between that and the prior "real" commit). But if you want to drill down into the microcommits, the information is always there.
Let's acknowledge the reality. Many people use git not just for version control, but for backup ("Let me commit this so I don't lose it"). Let's ensure the VC tool supports both and doesn't force you to pick one over the other.
You gain the extra information by having reasonable commit messages rather than the ones you mentioned. To fix CI you force push.
Can you explain to me what an avid squash-merger puts into the commit message of the squashed commit composed of commits "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", and "pushing before leaving for vacation" ?
> "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", "pushing before leaving for vacation"
These are all bad commits IMHO. Aside from the CI one, I understand that message. I have commits like that on personal projects but for professional projects I'd be frustrated if people were committing messages like that.
Personally I'm a "one commit" type of guy, I don't like committing things in a broken state even on a side branch unless I have to (to share the code or test a CI). Occasionally I will make multiple commits at the very end to make review easier or once I have everything working but I want to try something different but I have a bunch of options of saving code that don't involving committing:
- Stash
- Shevle (IDEA)
- Backblaze
- Time Machine
- Local History (IDEA)
The idea of committing WIP before leaving for a vacation just feels so wrong to me.
I once worked for someone who wanted developers to commit code before the end of every day as a safety measure. His reasoning was in case the developer's computer died or similar. I found that silly at the time and still do now. That's what backups are for, I dislike when people use git as a backup like that in a professional setting.
Squash merge is the only reasonable way to use GitHub:
If you update a PR with review feedback, you shouldn’t change existing commits because GitHub’s tools for showing you what has changed since your last review assume you are pushing new commits.
But then you don’t want those multiple commits addressing PR feedback to merge as they’re noise.
So sure, there’s workflows with Git that doesn’t need squashing. But they’re incompatible with GitHub, which is at least where I keep my code today.
Is it perfect? No. But neither is git, and I live in the world I am given.
Yes, I think people who are anti squash merge are those who don't work in Github and use a patch based system or something different. If you're sending a patch for linux, yes it makes sense that you want to send one complete, well described patch. But Github's tooling is based around the squash merge. It works well and I don't know anyone in real life who has issues with it.
And to counter some specific points:
* In a github PR, you write the main commit msg and description once per PR, then you tack on as many commits as you want, and everyone knows they're all just pieces of work towards the main goal of the eventually squashed commit
* Forcing a clean up every time you make a new commit is not only annoying extra work, but it also overwrites history that might be important for the review of that PR (but not important for what ends up in main branch).
* When follow up is requested, you can just tack on new commits, and reviewers can easily see what new code was added since their last review. If you had to force overwrite your whole commit chain for the PR, this becomes very annoying and not useful to reviewers.
* In the end, squash merge means you clean up things once, instead of potentially many times
If your goal here is to have linear history, then just use a merge commit when merging the PR to main and always use `git log --first-parent`. That will only show commits directly on main, and gives you a clean, linear history.
If you want to dig down into the subcommits from a merge, then you still can. This is useful if you are going back and bisecting to find a bug, as those individual commits may hold value.
You can also cherry pick or rollback the single merge commit, as it holds everything under it as a single unit.
This avoids changing history, and importantly, allows stacked PRs to exist cleanly.
The author is talking about the case where you have coherent commits, probably from multiple PRs/merges, that get merged into a main branch as a single commit.
Yeah, I can imagine it being annoying that sqashing in that case wipes the author attribution, when not everybody is doing PRs against the main branch.
However, calling all squash-merge workflows "stupid" without any nuance.. well that's "stupid" :)
I don't think there's much nuance in the "I don't know --first-parent exists" workflow. Yes, you may sometimes squash-merge a contribution coming from someone who can't use git well when you realize that it will just be simpler for everyone to do that than to demand them to clean their stuff up, but that's pretty much the only time you actually have a good reason to do that.
I think the point is that if you have to squash, the PR-maker was already gitting wrong. They should have "squashed" on their end to one or more smaller, logically coherent commits, and then submitted that result.
Squash-merge is entirely fine for small PRs. Cleaning up the commits in advance (probably to just squash them to one or two anyway) is extra work, and anything that discourages people from pushing often (to get the code off their local machine) needs to be well-justified. Just review the (smallish!) total outcome of all the commits and squash after review. A few well-placed messages on the commit, attached to relevant lines, are more helpful and less work than cleaning up the commit history of a smallish PR.
For really large PRs, I’m more inclined to agree with you, but those should probably have their own small-PR-and-squash-merge flow that naturally cleans up their git history, anyway.
I categorically disagree that squash-merge is “stupid” but agree there are many ways to skin this cat.
How does not squash merging deal with the fact that branches disappear when merging? What I mean is that the information "this commit happened in the context of this PR or this overarching goal" goes missing. When you squash, you use the one central unit of information management in Git: the commit.
Having the commit graph easy to filter means exactly that you don't have to sift through hundreds of commits for no reason. What else did you think it would mean?
These were interesting but I don't know if they'd work on most or any of the places I've worked. Most places and teams I've worked at have 2-3 small repos per project. Are most places working with monorepos these days?
For "what changes the most", in my project it's package.json / lock (because of automatic dependency updates) and translation / localization files; I'd argue that's pretty normal and healthy.
For the "bus factor", there's one guy and then there's me, but I stopped being a primary contributor to this project nearly two years ago, lol.
I just finished¹ building an experimental tool that tries to figure out if a repo is slopware or not just by looking at it's git history (plus some GitHub activity data).
The takeaway from my experiment is that you can really tell a lot by how / when / what people commit, but conclusions are very hard to generalize.
For example, I've also stumbled upon the "merge vs squash" issue, where squashes compress and mostly hide big chunks of history, so drawing conclusions from a squashed commit is basically just wild guessing.
(The author of course has also flagged this. But I just wanted to add my voice: yeah, careful to generalize.)
Trusting the messages to contain specific keywords seems optimistic. I don't think I used "emergency" or "hotfix" ever. "Revert" is some times automatically created by some tools (E.g. un-merging a PR).
For the stuff I've worked on, if you want to know about bugfixes and emergency releases, you'd go to Jira where those values are formalized as fields. Someone else in the comments here had a suggestion which just looks for the word "fix" which would definitely capture some bugfix releases, but is more likely to catch fixes that were done during development of a feature.
Out of curiosity, I ran the 5 command on my project's public git tree. The only informative one was #4 ("Is This Project Accelerating or Dying") - it showed cliffs when significant pieces of logic were decoupled and moved to other repos.
These commands are very useful, but adapting them to the codebase makes a huge difference.
For most, I added some filters and slightly changed the regex, and it showed the reality of the codebase (I already knew the reality, I just wanted to see if it matched, and it did).
My team usually uses "Squash and merge" when we finish PRs, so I feel that would skew the results significantly as it hides 99% of the commit messages inside the long description of the single squashed merge commit.
Some readme files include changelogs. But aside from that I think this can still net some useful information. I like to look at the most recently changed files in a repo as well.
Fair point. I skip lockfiles, changelogs, and generated code. The first application file on the list is the one that matters. Should have been explicit about that in the post.
It’s easy enough to filter those out with grep. It still is relatively meaningless. If the team incrementally adds things then it’s just going to show what additions were made. It isn’t churn at all.
This way I can see right away which branches are 'ahead' of the pack, what 'the pack' looks like, and what is up and coming for future reference ... in fact I use the 'gss' alias to find out whats going on, regularly, i.e. "git fetch --all && gss" - doing this regularly, and even historically logging it to a file on login, helps see activity in the repo without too much digging. I just watch the hashes.
superficial. If I have to unfuck the backend 10 times a week in our API adapter, then these commands will show me constantly changing the API adapter, although it's the backend team constantly fixing their own bugs
No searching the codebase/commits for "fuck" and shit"? That will give you an idea what what was put in under stressful circumstances like a late night during a crunch.
I'm so used to magit, it seems kind of primitive to pipe git output around like this.
Anyway, I can glean a lot of this information in a few minutes scrolling through and filtering the log in magit, and it doesn't require memorizing a bunch of command line arguments.
> The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”
I've got my Emacs set up to display next to every file that is versioned the number of commits that file has been modified in (for the curious: using a modified all-the-icons-ivy-rich + custom elisp code + custom Bash scripts I wrote and it's trickier than it seems to do in a way that doesn't slows everything down). For example in the menu to open a file or open a recently visited file etc.: basically in every file list, in addition to its size, owner, permissions, etc. I also add the number of commits if it's a versioned file.
I like the fix/bug/broken search in TFA to see where the bugs gather.
I was curious what information I could glean from these for some popular repos. Caveat: I'm primarily an low-level embedded developer so I don't interface with large open source projects at the source level very often (other than occasionally the linux kernel). I chose some projects at random that I use.
*Mainline linux*
Most changed files: pretty much what I expected for 1 and 2... the "cutting edge" of Linux development over other OSes -- bpf and containers. The bpf verifier and AMD GPU driver might get a boost in this list due to sheer LoCs in those files (26K and 14K respectively). An intel equivalent of amdgpu_dm is #21 in the list (drivers/gpu/drm/i915/display/intel_display.c) and nvidia is nowhere to be seen (presumably due to out-of-tree modules/blobs?).
10399 Christoph Hellwig -> I only know his name because of drama last year regarding rust bindings to DMA subsystem
8481 Mauro Carvalho Chehab -> I also know his name from the classic "Mauro, shut the fuck up!" Linus rant
8413 Takashi Iwai -> Listed as maintainer for sound subsystem, I think he manages ALSA
8072 Al Viro -> His name is all over bunch of filesystem code
Buggy files: Intel comes out on top of GPU drivers this time (twice). Along with KVM for x86(64), the main allocator, and BTRFS.
Buggy files: DWARF debuginfo generation, x86 heuristics tables, RS6000(?!) heuristic tables. I had to look up RS6000, it's an IBM instruction set from the 90s lol. cp-tree.h is an interesting file, it seems be the main C(++) AST datastructures.
*xfwm4*
Most changed files: the list is dominated by *.po localizations. I filtered these out. Even after this, I discovered there is very little active development in the last few years. If I extend to 4 years ago, I get:
1. src/client.c - Realizing this project is too "small" to glean much from this. client.c is just the core X client management code. Makes sense.
2. src/placement.c - Other core window management code.
This has not told me much other than where most of the functionality of this project lies.
Bus factor: Pretty huge. Not really an issue in this case due to lack of development I guess.
Files with bug commits: Very similar distribution to most changed files. Not enough datapoints in this one to draw any big conclusions.
I think these massive open projects (excl xfwm) are generally pretty consistent code quality across the heavily trodden areas because of the amount of manpower available to refactor the pain points. I've yet to see an example of "god help you if you have to change that file" in e.g. linux, but I have of course seen that situation many times in large proprietary codebases.
Big projects tend to self-correct. These commands hit differently on private codebases with 3-10 contributors, where high-churn usually means one person patching the same thing repeatedly.
Jujutsu equivalents, if anyone is curious:
What Changes the Most
Who Built This
Where Do Bugs Cluster
Is This Project Accelerating or Dying
How Often Is the Team Firefighting
Much more verbose, closer to programming than shell scripting. But less flags to remember.
To me, it makes jujutsu look like the Nix of VCSes.
Not meaning to offend anyone: Nix is cool, but adds complexity. And as a disclaimer: I used jujutsu for a few months and went back to git. Mostly because git is wired in my fingers, and git is everywhere. Those examples of what jujutsu can do and not git sound nice, but in those few months I never remotely had a need for them, so it felt overkill for me.
It's the dvorak of git... Maybe more efficient but incompatible with everyone else and a very loud vocal minority.
You can find this pattern again and again. How many redditors say 120fps is essential for gaming or absolutely require a mechanical keyboard?
3 replies →
Tbf you wouldn't use/switch to jj for (because of) those kind of commands, and are quite the outlier in the grand list of reasons to use jj. However the option to use the revset language in that manner is a high-ranking reason to use jj in my opinion.
The most frequent "complex" command I use is to find commits in my name that are unsigned, and then sign them (this is owing to my workflow with agents that commit on my behalf but I'm not going to give agents my private key!)
I hadn't even spared a moment to consider the git equivalent but I would humbly expect it to be quite obtuse.
5 replies →
No, jj is super simple in daily use, in contrast with git that is a constant chore (and any sane person use alias). This include stuff that in git is a total mess of complexity like dealing with rebases. So not judge the tool for this odd case.
1 reply →
I don’t understand how people can remember all these custom scripting languages. I can’t even remember most git flags, I’m ecstatic when I remember how to iterate over arrays in “jq”, I can’t fathom how people remember these types of syntaxes.
I am convinced that the vast majority of professionals simply don't bother to remember and, ESPECIALLY WITH GIT, just look stuff up every single time the workflow deviates from their daily usage.
At this point perhaps a million person-years have been sacrificed to the semantically incoherent shit UX of git. I have loathed git from the beginning but there's effectively no other choice.
That said, the OP's commands are useful, I am copying them (because obviously I won't ever memorize them).
10 replies →
I don't, I will google things and fiddle, then put it in a git alias (with a comment on what it does and / or where I got it from) and push it to my private dotfiles repo, taking it with me between computers and projects.
jj's template and revset languages are very simple syntactically, so once you're comfortable with the few things you do use often it's just a question of learning about the other existing functions (even if only enough to know to look them up), which slot right in and compose well with everything else you know (unlike flags which typically have each their own system).
Or, perhaps better yet, defining your own functions/helpers as you go for things you might care about, which, by virtue of having been named you, are much easier to remember (and still compose nicely).
You research it once, use it and then remember that it has "ancestor" in the command somewhere and then use ctrl + R to dig up something from your shell history.
So, how does one iterate over an array in jq? Asking for a friend.
> I don’t understand how people can remember all these custom scripting languages.
We can't.
Why do you think the `man` command exists?
Some things are idioms that one repeats so often they just stick, e.g. I use "grep.... | cut -c x-y | sort | uniq -c | sort -nr" to quickly grep frequency of some events from a log file.
Don't feel bad - no one remembers them all, we just remember a few idioms we use...
Same, but now with AI I don't have to remember that anymore
If I look something up twice, I record it in Obsidian. If I need it more than a couple of times, I'll probably make an alias, a script or a mask [1] file. Autocomplete and autosuggest are essential to my workflow. And good history search.
[1] https://github.com/jacobdeichert/mask
Nobody does. One person figures it out, then writes a blog post, and we all Google for it. Even git’s man pages are long and sometimes cryptic.
Yeah especially with git. All I know is pull, add, commit, push. Everything else I have to look up.
You add them into your GIT config file as shortcuts?
If you have multiple machines (/must have), just apply your user config to current machine?
If you don't have to codedive new projects all the time, there's zero reason to memorize these. If your job is to look at new codebases all the time, you probably learn to remember these commands pretty quickly.
a project isn’t dying because of no commits. Rather it’s stable
I often feel I need to setup bots to make superfluous commits just to make it look like my useful and stable repos are “active”
One example (not mine) a a qr-code generator library. Hasn’t been updated in 10 years. It’s perfect as is. It just provides the size and the bits. You convert those bits to any representation you want. It has no need to be updated
In a real company? A private codebase at a minimum should still be getting regular security patching and dependency updates. Always eventually one of those updates requires some level of refactor. If I see a project with no commits, I run away.
It's rare, I think, for a project to have such a well defined and singular purpose that has not changed in 10 years nor have any bugs been discovered or its dependencies changed underneath it.
It's not impossible, of course, but if I saw even a qr library that hadn't changed in 10 years I would worry that it wouldn't build on current systems (due to dependencies) and that nobody was actually using it (due to lag of bug reports).
3 replies →
This might be true for libraries or utilities that have a well-defined scope and no dependencies, but that's not what the article is focused on. When considering a company's main product, it's usually never done and patterns of activity—and especially changes in those patterns—can give you insight into potential issues.
> a project isn’t dying because of no commits. Rather it’s stable
Agreed. Assuming there are no open issues and PRs. When I find a project, if the date of the last commit is old, I next look at the issues and PRs. If there are simple-to-deal-with issues (e.g. a short question or spam) and easy-to-merge PRs (e.g. fixing a typo in the README) which have been left lingering for years, it’s probably abandoned. Looking at the maintainer’s GitHub activity graph should provide more clues.
> I often feel I need to setup bots to make superfluous commits just to make it look like my useful and stable repos are “active”
I have never done it, but a few times thought about making a “maintenance” release to bump the version number and release date, especially since I often use a variant of calendar versioning.
I don't want to program git, I want to get stuff done so I would reject using that tool and do what the article author did running tried and true pipeable Linux/UNIX commands. It's also the same reason why I dislike Gradle and use Maven, I don't want to program my build I want to define and run my build.
But the git commands in the article is also programming of the same kind, just using more terse, more obscure language. All the shell pipelines are sort, uniq, and grep.
A language that properly maps to the data model, and has readable identifiers is a boon. Git is a database, a database needs a proper query language.
I can't remember all of this, does anyone know of any LLM model trained on CLI which can be run locally?
If you copy those commands into a file and use that file to prompt the “sh” LLM.
1 reply →
Try https://cheat.sh/
Not a model, but a product: warp.dev
Hah someone really looked at jq (?) and thought: "yes, more of this everywhere". I feel jq is like marmite (edit: aka vegemite, i.e. "you either love it or you hate it")
It doesn't seem any more egregious than something like:
`git log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --`
Which is something I see a lot of people alias in Git for viewing logs.
It's really not that bad, although the jq comparison might be apt. You have such primitives you need to understand, and then everything just fits together nicely. I find this much easier to write and understand than git's cryptic format strings.
Disclaimer: I love jq too :)
[dead]
I love how the author thinks developers write commit messages.
All joking aside, it really is a chronic problem in the corporate world. Most codebases I encounter just have "changed stuff" or "hope this works now".
It's a small minority of developers (myself included) who consider the git commit log to be important enough to spend time writing something meaningful.
AI generated commit messages helps this a lot, if developers would actually use it (I hope they will).
And in a squash and merge workflow, which are most teams I've been on the past 8 years, it really is the title of the pull request or merge request. That is what really matters.
And I really like that because it leaves room to let the developer do whatever kind of commit messages they want to that makes sense to them. Because nobody's really ever going to read those again after it squashed and merged.
- fix - fix fr - fix frfr - plz - omg why - never gonna give you up - never gonna let you down - add missing curly braces
This is a team lead/CTO problem. A good leader will be explicit in their expectations that developers write good commit messages. I've certainly had good leaders that expect this.
Yes, and a culture problem, too. I guess I've been blessed that I've mostly only worked for "grown up" companies, but I've never encountered a workplace where people didn't write useful commit messages. At least one line description of the work done, but often multiple lines of valuable context. Only the junior devs had to be told to do it, but once they got into the habit, everyone understood why we do it and it was no big deal.
If I joined a company where people committed their code with "stuff" or "made some changes" or "asdfhlfo;ejfo;ae," that would be a red flag that I might have joined the wrong company, and I'd start to wonder what else the developers here do carelessly.
4 replies →
I once tutored an intern. Who thought he was The Best Programmer On Earth (didn't we all at that age?). He refused to use revision control, it slowed him down.
So we told him to commit at least once every day, with a relevant commit message, or else fail his internship.
He worked 21 more days. There were 21 commits: "17:00, time to go home".
1 reply →
I think it's a stretch to measure leadership quality on something so minor, a lot of teams find them pretty useless no matter how good they are.
14 replies →
I particularly love when the “CTO” is also the main offender.
In codebases where PRs are squashed on merge, the commit messages on the main branch end up being the PR body description text, and that's actually reviewed so tends to be much better I find.
I've seen it be the concatenated individual git commit messages way too often. Just a full screen scroll of "my hands are writing letters" and "afkifrj". Still better than if we had those commits individually of course, but dear god.
The gold standard is rebased linear unsquashed history with literary commits, but I'll take merged and squashed PR commits with sensible commit messages without complaint.
And in every codebase I've been in charge of, each PR has one or more issue # linked which describe every possible antagonizing detail behind that work.
I understand this isn't inline with traditional git scm, but it's a very powerful workflow if you are OK with some hybridization.
2 replies →
If developers don't write commit messages, that's a culture problem. At my company we demand that of each other.
Only two of the five depend on commit messages. Churn, authorship, and velocity work regardless. Even teams with terrible hygiene write "fix" when something breaks.
As noted, authorship does not if commits are squashed, which seems to be common (I never do it).
> Even teams with terrible hygiene write "fix" when something breaks.
They might not include anything but the Jira ticket number, if the environment is truly lacking.
Our small team has a lot of commit messages like this. For a while, we had a guy on the team who had come from a site that expected more. The pet peeve he brought along was that commit messages end with a period (my guess is that someone at their previous work place had reasoned that forcing periods encouraged developers to actually write meaningful sentences). When I look at that period of development, I see lots of messages like “stuff changed.” And “more stuff changed.” And then it goes back to just “stuff changed” around the time they moved on.
> my guess is that someone at their previous work place had reasoned that forcing periods encouraged developers to actually write meaningful sentences
I have actually seen proper capitalization and correct conventional-commit types to correlate very well with the author being intentional and the patch being of good quality.
e.g.
- (a) chore: update some_module to include new_func
- (b) feat: Add new_func to handle XYZ case
Where:
(a) is not a chore, as it changes functionality, is uncapitalized and is so low-signal I can probably write a 10 line script to reliably generate similar titles.
(b) is using the correct "feat" commit type, capitalized and describe what this is for. I expect the body to explain "why", as well, and not to reiterate the "how" in natural language.
This is just my experience, but I've seen commit messages where people actually put in some effort to usually come with a good patch, and vice-versa.
I personally use git commit -m "." for: "Just snapshots this state real quick" on a feature branch.
main branch is advanced on PR level, with squashed commits.
So the "." should never make it to main, and have PR description as commit message.
> It's a small minority
Is it really a small minority? I have never worked on a project that didn't have commit messages that at least tried to be descriptive (sometimes people fail at it but its very different to an outright "changed stuff").
I don't remember any friend mentioning to me them encountering a work project where the messages were totally neglected either.
Never seen that in any company I worked at either and I can’t believe professional developers seem to think that it would be ok to write meaningless commit messages. That’s just so sloppy.
tbh I'm not convinced that a git log history should be treated as a group journal because it's not.
relying on git commit messages assumes they're correct by convention since there is no technical constraint to enforce it. and it assumes no work in progress commits, sometimes it's just necessary to hit the save button real quick or move a workspace from one device to another.
my point is: git is a way of storing and loading files at its core.
Bad commit messages always fail PR review for me. It requires will and discipline, but it's not that hard.
Sometimes it could be just a ticket number/title
Only two of the five insights are based on commit messages and the author acknowledges that they won't work in projects without message discipline. But the remaining ones will give you valuable insights even into the most lazy project department.
Commit messages are often squashed after merging a feature branch
> AI generated commit messages
git log --oneline and a sprinkle of your personal sauce on .claude goes a long way :)
I love how the commentator thinks a developer makes decisions based on commit messages.
Random, subjective, or written in a state of mental exhaustion commit messages.
I also love the switcheroo the author made: git not logs. But hey :)
It's because the vast majority of commit messages are never read by anyone, and there's other ways to fund out what happened in the handful of cases where you would need to.
I read commit messages all the time to figure out what a change was about.
For small personal projects I often write one phrase messages with `-m`, but if you're working with other people you should be writing good commit messages.
git blame's are amazing, and rely on good comments.
I ran these commands on a number of codebases I work on and I have to say they paint a very different picture than the reality I know to be true.
> git shortlog -sn --no-merges
Is the most egregious. In one codebase there is a developer's name at the top of the list who outpaced the number 2 by almost 3x the number of commits. That developer no longer works at the company? Crisis? Nope, the opposite. The developer was a net-negative to the team in more ways than one, didn't understand the codebase very well at all, and just happened to commit every time they turned around for some reason.
Yeah. I am the top committer at my current workplace, but I'd say that a majority of that gap is because my particular workflow results in many smaller commits than my coworkers.
Everything in context. This is one of many reasons I'm a proponent of squash-and-merge. If a change really needs more than one permanent commit, it should probably be split up or if absolutely necessary should be on a feature branch maintaining similar process. Under this process, feature branches are not squashed.
This leaves developers to commit locally and comment as much or little as they like.
Also - be careful of automated workflow that uses a single persons credential.. skews this by a lot.
So that person, on one central codebase at a company I work for, is me.
Assuming I'm not ego-mad, I like to think this is because I built the project from the ground up before handing it over to the rest of the team.
These days other people commit more often than I do, but my name is still dominant, and probably will be for some time.
I'm not saying more commits = bad developer. In my example that happened to be the case but not because they had a lot of commits but because they were bad at their job. I was just trying to warn that taking these git snippets at face-value does not paint the full picture.
If someone came to me and said "I ran these and I see XXX was the most prolific committer and they left X months ago, what will be do???" I'd have to work hard not to laugh.
Since these snippets are self-described as ways to get familiar with the code/projects I wanted to provide the counter point. Most of those snippets do not at all paint the real picture and for all the repos I tested it on they paint the opposite of reality.
I know these codebases like the back of my hand, the purported purpose of these snippets is to better understand the codebase, I can tell you they don't work for anything I tested them on. Maybe they work for other codebases but the sample size I have access to says they don't work for me.
1 reply →
> The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”
The most changed file is the one people are afraid of touching?
Just like that place that's so crowded nobody goes there anymore.
I've just tried this, and the most touched files are also the most irrelevant or boring files (auto generated, entry-point of the service etc.) in my tests.
Yeah same thing happens with lockfiles and CI configs. You end up filtering out half the list before it tells you anything useful.
I just tried it too and it basically just flagged a handful of 1500+ line files which probably ought to be broken up eventually but arent causing any serious problems.
1 reply →
Plotting Churn against Complexity is far more useful than merely churn.
It shows places that are problematic much better. High churn, low complexity: fine. Its recognized and optimizef that this is worked on a lot (e.g. some mapping file, a dsl, business rules etc). Low churn high complexity: fine too. Its a mess, but no-one has to be there. But both? Thats probably where most bugs originate, where PRs block, where test coverage is poor and where everyone knows time is needed to refactor.
In fact, quite often I found that a teams' call "to rewrite the app from scratch" was really about those few high-churn-high-complexity modules, files or classes.
Complexity is a deep topic, but even simple checks like how nested smt is, or how many statements can do.
This command needs a warning. Using this command and drawing too many conclusions from it, especially if you’re new, will make you look stupid in front of your team mates.
I ran this on the repo I have open and after I filtered out the non code files it really can only tell me which features we worked on in the last year. It says more about how we decided to split up the features into increments than anything to do with bugs and “churn”.
I found it interesting, that Git itself has built in similarity notion... when it packs objects, it groups files by path+size, runs delta cmpression to find which are close.
Very different from just counting commits - https://vectree.io/c/delta-compression-heuristics-and-packfi...
Good thing that the article contains that warning, then.
4 replies →
These commands are just about what files to start looking at to understand new codebase.
Better for people to know they're just blindly copying tools and parroting their output as if it's automatically meaningfully. Any warning against that should be built into the individual, for their own sake
1 reply →
Yes. Because the fear is butressed with necessity. You have to edit the file, and so does everyone else and that is a recipe for a lot of mess. I can think back over years of files like this. Usually kilolines of impossible to reason about doeverything.
Definitely not in my experience. The most changed are the change logs, files with version numbers and readmes. I don't think anyone is afraid of keeping those up to date.
pom.xml and package.json came up on couple of separate projects I ran the commands on. Which makes sense because the versions get bumped rather frequently. I guess context matters, as usual.
Yeah, the truth is going to be a lot more subtle than this.
The LLM that wrote the copy is an idiot.
This is such obvious LLM slop.
Could be also that a frequently edited file had most opportunity to be broken. And it was edited by the most random crowd.
In my case, it's .github/CODEOWNERS.
Nobody is afraid of changing it.
Why does github owners need frequent change? Do members in you team change so often?
I have a summary alias that kind of does similar things
EDIT: props to https://github.com/GitAlias/gitalias
Curious - why write it as a function in presumably .gitconfig and not just a git-summary script in your path? Just seems like a lot of extra escapes and quotes and stuff
It's a very old config that I copied from someone many years ago, agree that it's a bit hard to parse visually.
Not the poster, but one theory: so you only need to copy one file. Portability.
2 replies →
Looks nice. Unfortunately I don't have log-of-count-and-email, log-of-count-and-day or churn
+1 Same.
Edit. https://github.com/mattrighetti/dotfiles/blob/master/.gitcon...
1 reply →
You could make a local `man` page.
Some nice ideas but the regexes should include word boundaries. For example:
git log -i -E --grep="\b(fix|fixed|fixes|bug|broken)\b" --name-only --format='' | sort | uniq -c | sort -nr | head -20
I have a project with a large package named "debugger". The presence of "bug" within "debugger" causes the original command to go crazy.
Note this doesn't work by default on macOS, where git uses the POSIX version of grep (which doesn't support `\b`). You need to use the Perl Regexp option by switching -E with -P:
git log -i -P --grep="\b(fix|fixed|fixes|bug|broken)\b" --name-only --format='' | sort | uniq -c | gsort -nr | head -20
Good catch, that's better
> One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.
In my experience, when the team doesn't squash, this will reflect the messiest members of the team.
The top committer on the repository I maintain has 8x more commits than the second one. They were fired before I joined and nobody even remembers what they did. Git itself says: not much, just changing the same few files over and over.
Of course if nobody is making a mess in their own commits, this is not an issue. But if they are, squash can be quite more truthful.
I wouldn't trust "commit counts." The quality and content of a "commit" can vary widely between developers. I have one guy on my team who commits only working code that has been thoroughly tested locally, another guy who commits one line changes that often don't work, only to be followed by fixes, and more fixes. His "commits" have about 1/100th of the value of the first guy.
The author does not look at counter values but rather at how the values changes. That reveals dynamics.
My comment still seems relevant? Do frequent commits to correct mistakes imply more "value" than infrequent, but well tested, commits, or what? I don't think it is a reliable signal.
> Is This Project Accelerating or Dying > > git log --format='%ad' --date=format:'%Y-%m' | sort | uniq -c
If the commit frequency goes down, does it really mean that the project is dying? Maybe it is just becoming stable?
For this command in particular, one can add a cheap bar chart with awk:
git log --format='%ad' --date=format:'%Y-%m' | sort | uniq -c | awk '{printf $2" "; for (i=1;i<=$1;i++){printf "-";} print ""; }'
This is a neat trick for a quick visual presentation, thanks!
Yeah, this one demonstrates a particularly pernicious view of software development. One where growth, no matter how artificial, is the only sign of success.
If you work with service oriented software, the projects that are "dying" may very well be the most successful if it's a key component. Even from a business perspective having to write less code can also be a sign of success.
I don't know why this was overlooked when the churn metric is right there.
Bad memories at my former big tech company.
Whenever we initiated a new (internal) SW project, it had to go through an audit. One of the items in the checklist for any dependency was "Must have releases in the last 2 years"
I think the rationale was the risk of security vulnerabilities not being addressed, but still ...
That was my question too. I have plenty of projects I've worked on where they rarely get touched anymore. They don't need new features and nothing is broken.
Is it fair to say they are being "worked on" if nothing is being done?
1 reply →
Technically you're correct that change frequency doesn't necessarily mean dead, but the number of projects that are receiving very few updates because they're 'done' is a fraction of a fraction of a percent compared to the number that are just plain dead. I'm certain you can use change frequency as a proxy and never be wrong.
> I'm certain you can use change frequency as a proxy and never be wrong.
I (largely) wrote a corporate application 8 years ago, with 2 others. There was one change 2 years ago from another dev.
Lots of programs are functionally done in a relatively short amount of time.
"Accelerating or Dying" sounds like private equity's lazy way to describe opportunity, not as a metric to describe software.
2 replies →
Projects become more stable with time? Since when?
Something something Red Queen's race
Or you hired someone who squashes or doesn't commit every single change.
Rather than using an LLM to write fluffy paragraphs explaining what each command does and what it tells them, the author should have shown their output (truncated if necessary)
I also feel this reads like an AI slop, but at least I learned 5 commands. Not too bad.
Interesting ideas, but some to me seem very overgeneralizef, e.g.:
> How Often Is the Team Firefighting
> git log --oneline --since="1 year ago" | grep -iE 'revert|hotfix|emergency|rollback
> Crisis patterns are easy to read. Either they’re there or they’re not.
I disagree with the last two quoted sentences, and also, they sound like an LLM.
This one was funny to me because sure, it was accurate for my particular codebase, but also anyone paying attention to the company Slack would already know how often fires happen.
Solid list. I'd add git log --all --oneline --graph pretty early on — gives you a quick sense of how active different branches are and whether this is a "one person commits everything" project or actually distributed. Helped me a ton on a job where I inheritied a monolith with like 4 years of history.
The git blame tip is underrated. People treat it like a gotcha tool but its maybe the fastest way to find the PR/ticket that explains a weird decision.
To me all of these are symptoms of the problem that I outlined in my recent blog post: https://news.ycombinator.com/item?id=47606192
and it touches in detail what exactly commit standards should be, and even how to automate this on CI level.
And then I also have idea/vision how to connect commits to actual product/technical/infra specs, and how to make it all granular and maintainable, and also IDE support.
I would love to see any feedback on my efforts. If you decide to go through my entire 3 posts I wrote, thank you
Dying or stabilizing?
Most good projects end up solving a problem permanently and if there is no salary to protect with bogus new features it is then to be considered final?
Instead of focusing on the top 20 files, you can map the entire codebase with data taken from git log using ArcheoloGit [1].
[1]: https://github.com/marmelab/ArcheoloGit
> One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.
I abhor squash merging for this and a few other reasons. I literally have to go out of my way to re-check out a branch. Someone who wants to use my current branch cannot do so if I merge my changes a month later, because the squash rewrites history, and now git is very confused. I don't get the obsession with "cleaning up the history" as if we're all always constantly running out of storage over 2 more commits.
For me the benefit is that I can revert or cherry-pick things one entire PR at a time, and I don't have to care if the author implemented their PR with a bunch of small "work in progress" commits.
And GitHub at least sets the author of the squashed commit as the one who opened the PR, not the one who merged it.
I can definitely see where it wouldn't work well for other workflows but I've had it work well on several teams and it seems easier than trying to get everyone to clean up their commits into nice, clean, well-titled histories before putting up a PR.
> If the team squashes every PR into a single commit, this output reflects who merged, not who wrote.
Squash-merge workflows are stupid (you lose information without gaining anything in return as it was easily filterable at retrieval anyway) and only useful as a workaround for people not knowing how to use git, but git stores the author and committer names separately, so it doesn't matter who merged, but rather whether the squashed patchset consisted of commits with multiple authors (and even then you could store it with Co-authored-by trailers, but that's harder to use in such oneliners).
Can you explain to me (an avid squash-merger) what extra information do you gain by having commits that say "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", "pushing before leaving for vacation" in the main git history?
With a squash merge one PR is one commit, simple, clean and easy to roll back or cherry-pick to another branch.
These commits reaching the reviewer are a sign of either not knowing how to use git or not respecting their time. You clean things up and split into logical chunks when you get ready to push into a shared place.
31 replies →
If someone uses git commits like the save function of their editor and doesn't write messages intended for reading by anyone else, it makes sense to want to hide them
For other cases, you lose the information about why things are this way. It's too verbose to //comment on every like with how it came to be this way but on (non-rare in total, but rare per line) occasion it's useful to see what the change was that made the line be like this, or even just who to potentially ask for help (when >1 person worked on a feature branch, which I'd say is common)
5 replies →
Trivial and not too silly example:
Part of new feature you had working in an intermediate commit, but broke somewhere along the way and is not working in your last commit when you squashed.
If you catch it early enough, I suppose it's in your reflog, but otherwise you're screwed.
It sounds like a silly example, but I bet most developers have run into this at some point.
With mercurial/jujutsu, you get the best of both worlds: The "argh, let's see if this works" commits are what I call "microcommits", and the squashed versions are the real/public commits. With jujutsu, you get both. Your log shows only the "real" commits (equivalent of squashing all the commits between that and the prior "real" commit). But if you want to drill down into the microcommits, the information is always there.
Let's acknowledge the reality. Many people use git not just for version control, but for backup ("Let me commit this so I don't lose it"). Let's ensure the VC tool supports both and doesn't force you to pick one over the other.
You gain the extra information by having reasonable commit messages rather than the ones you mentioned. To fix CI you force push.
Can you explain to me what an avid squash-merger puts into the commit message of the squashed commit composed of commits "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", and "pushing before leaving for vacation" ?
2 replies →
> "argh, let's see if this works", "crap, the CI is failing again, small fix to see if it works", "pushing before leaving for vacation"
These are all bad commits IMHO. Aside from the CI one, I understand that message. I have commits like that on personal projects but for professional projects I'd be frustrated if people were committing messages like that.
Personally I'm a "one commit" type of guy, I don't like committing things in a broken state even on a side branch unless I have to (to share the code or test a CI). Occasionally I will make multiple commits at the very end to make review easier or once I have everything working but I want to try something different but I have a bunch of options of saving code that don't involving committing:
- Stash
- Shevle (IDEA)
- Backblaze
- Time Machine
- Local History (IDEA)
The idea of committing WIP before leaving for a vacation just feels so wrong to me.
I once worked for someone who wanted developers to commit code before the end of every day as a safety measure. His reasoning was in case the developer's computer died or similar. I found that silly at the time and still do now. That's what backups are for, I dislike when people use git as a backup like that in a professional setting.
Why are those commits ending in the PR? Just unprofessional to work like that.
git bisect gets more useful because it will pin a smaller set of changes
Squash merge is the only reasonable way to use GitHub:
If you update a PR with review feedback, you shouldn’t change existing commits because GitHub’s tools for showing you what has changed since your last review assume you are pushing new commits.
But then you don’t want those multiple commits addressing PR feedback to merge as they’re noise.
So sure, there’s workflows with Git that doesn’t need squashing. But they’re incompatible with GitHub, which is at least where I keep my code today.
Is it perfect? No. But neither is git, and I live in the world I am given.
Yes, I think people who are anti squash merge are those who don't work in Github and use a patch based system or something different. If you're sending a patch for linux, yes it makes sense that you want to send one complete, well described patch. But Github's tooling is based around the squash merge. It works well and I don't know anyone in real life who has issues with it.
And to counter some specific points:
* In a github PR, you write the main commit msg and description once per PR, then you tack on as many commits as you want, and everyone knows they're all just pieces of work towards the main goal of the eventually squashed commit
* Forcing a clean up every time you make a new commit is not only annoying extra work, but it also overwrites history that might be important for the review of that PR (but not important for what ends up in main branch).
* When follow up is requested, you can just tack on new commits, and reviewers can easily see what new code was added since their last review. If you had to force overwrite your whole commit chain for the PR, this becomes very annoying and not useful to reviewers.
* In the end, squash merge means you clean up things once, instead of potentially many times
1 reply →
If your goal here is to have linear history, then just use a merge commit when merging the PR to main and always use `git log --first-parent`. That will only show commits directly on main, and gives you a clean, linear history.
If you want to dig down into the subcommits from a merge, then you still can. This is useful if you are going back and bisecting to find a bug, as those individual commits may hold value.
You can also cherry pick or rollback the single merge commit, as it holds everything under it as a single unit.
This avoids changing history, and importantly, allows stacked PRs to exist cleanly.
1 reply →
The author is talking about the case where you have coherent commits, probably from multiple PRs/merges, that get merged into a main branch as a single commit.
Yeah, I can imagine it being annoying that sqashing in that case wipes the author attribution, when not everybody is doing PRs against the main branch.
However, calling all squash-merge workflows "stupid" without any nuance.. well that's "stupid" :)
I don't think there's much nuance in the "I don't know --first-parent exists" workflow. Yes, you may sometimes squash-merge a contribution coming from someone who can't use git well when you realize that it will just be simpler for everyone to do that than to demand them to clean their stuff up, but that's pretty much the only time you actually have a good reason to do that.
4 replies →
I think the point is that if you have to squash, the PR-maker was already gitting wrong. They should have "squashed" on their end to one or more smaller, logically coherent commits, and then submitted that result.
1 reply →
Squash-merge is entirely fine for small PRs. Cleaning up the commits in advance (probably to just squash them to one or two anyway) is extra work, and anything that discourages people from pushing often (to get the code off their local machine) needs to be well-justified. Just review the (smallish!) total outcome of all the commits and squash after review. A few well-placed messages on the commit, attached to relevant lines, are more helpful and less work than cleaning up the commit history of a smallish PR.
For really large PRs, I’m more inclined to agree with you, but those should probably have their own small-PR-and-squash-merge flow that naturally cleans up their git history, anyway.
I categorically disagree that squash-merge is “stupid” but agree there are many ways to skin this cat.
Calling squash stupid sounds like a case of Dunning-Kruger.
If you've worked on a large team without squashing and without increasing frustration I'd be greatly interested to hear about it.
How does not squash merging deal with the fact that branches disappear when merging? What I mean is that the information "this commit happened in the context of this PR or this overarching goal" goes missing. When you squash, you use the one central unit of information management in Git: the commit.
Having the tree easy to filter doesn't matter if it returns hundreds of commits you have to sift through for no reason.
Having the commit graph easy to filter means exactly that you don't have to sift through hundreds of commits for no reason. What else did you think it would mean?
These were interesting but I don't know if they'd work on most or any of the places I've worked. Most places and teams I've worked at have 2-3 small repos per project. Are most places working with monorepos these days?
I can't speak for most, but the past few places I consulted or worked at used monorepos.
Jesus I've seen what you've done for others and want that for myself.
1 reply →
These are actually fun to run. Just checked from work who makes most commits and found I have as many commits in past 2 years as 3 next people.
That probably isn’t a good sign
For "what changes the most", in my project it's package.json / lock (because of automatic dependency updates) and translation / localization files; I'd argue that's pretty normal and healthy.
For the "bus factor", there's one guy and then there's me, but I stopped being a primary contributor to this project nearly two years ago, lol.
This is good stuff. Why I never think of things like this is beyond me. Thanks
Biggest life changer for me has been:
git clone --depth 1 --branch $SomeReleaseTag $SomeRepoURL
If you only want to build something, it only downloads what you need to build it. I've probably saved a few terabytes at this point!
These are some helpful heuristics, thanks.
This list is also one of many arguments for maintaining good Git discipline.
Nice set of commands! I would suggest using --all flag with git log though - scans through all branches and not just the current one
Can't resist making it as a git command https://github.com/zdk/git-critique
I just finished¹ building an experimental tool that tries to figure out if a repo is slopware or not just by looking at it's git history (plus some GitHub activity data).
The takeaway from my experiment is that you can really tell a lot by how / when / what people commit, but conclusions are very hard to generalize.
For example, I've also stumbled upon the "merge vs squash" issue, where squashes compress and mostly hide big chunks of history, so drawing conclusions from a squashed commit is basically just wild guessing.
(The author of course has also flagged this. But I just wanted to add my voice: yeah, careful to generalize.)
¹ Nothing is ever finished.
Ages ago, google released an algorithm to identify hotspots in code by using commit messages. https://github.com/niedbalski/python-bugspots
Ages ago google wrote an algorithm to detect hotspots by using commit messages, https://github.com/niedbalski/python-bugspots
Trusting the messages to contain specific keywords seems optimistic. I don't think I used "emergency" or "hotfix" ever. "Revert" is some times automatically created by some tools (E.g. un-merging a PR).
For the stuff I've worked on, if you want to know about bugfixes and emergency releases, you'd go to Jira where those values are formalized as fields. Someone else in the comments here had a suggestion which just looks for the word "fix" which would definitely capture some bugfix releases, but is more likely to catch fixes that were done during development of a feature.
Out of curiosity, I ran the 5 command on my project's public git tree. The only informative one was #4 ("Is This Project Accelerating or Dying") - it showed cliffs when significant pieces of logic were decoupled and moved to other repos.
The last sentence of the article is "Here’s what the rest of the week looks like." and then it just stops. Am I missing something?
It’s phrased in a confusing way, seems to be a mistake. It becomes clearer when you click the link in the sentence before.
It is probably referring to the article linked in the "codebase audit" text
For more insights on Git, check out https://github.com/nolasoft/okgit
These commands are very useful, but adapting them to the codebase makes a huge difference.
For most, I added some filters and slightly changed the regex, and it showed the reality of the codebase (I already knew the reality, I just wanted to see if it matched, and it did).
Thanks. What a great Skill for my Claude
My team usually uses "Squash and merge" when we finish PRs, so I feel that would skew the results significantly as it hides 99% of the commit messages inside the long description of the single squashed merge commit.
> The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about.
What a weird check and assumption.
I mean, surely most of the "20 most-changed files" will be README and docs, plus language-specific lock-files etc. ?
So if you're not accounting for those in your git/jj syntax you're going to end up with an awful lot of false-positive noise.
Why would you touch the README file hundreds of times a year?
You're right about package.json, pnpm-lock etc though, but those are easy to filter out if the project in question uses them.
> Why would you touch the README file hundreds of times a year?
You're right, perhaps I should have said CHANGELOG etc.
Although some projects e.g. bump version numbers in README or add extra one-liner examples ....
Some readme files include changelogs. But aside from that I think this can still net some useful information. I like to look at the most recently changed files in a repo as well.
Fair point. I skip lockfiles, changelogs, and generated code. The first application file on the list is the one that matters. Should have been explicit about that in the post.
It’s easy enough to filter those out with grep. It still is relatively meaningless. If the team incrementally adds things then it’s just going to show what additions were made. It isn’t churn at all.
I created a small TUI based on the article https://github.com/mikaoelitiana/git-audit
You beat me to it! I envisioned creating some aliases but you exceeded it by building a TUI. Good job Claude! LOL ;)
i’ll try to use the in an hook and test them with Claude. Thank you !
I put it into a gist :)
https://gist.github.com/aeimer/8edc0b25f3197c0986d3f2618f036...
Before, I ask AI "is this project maintained" done.
Ah yes, good old
I use it often.
Great tips, added to notes.txt for future use ..
Another one I do, is:
This way I can see right away which branches are 'ahead' of the pack, what 'the pack' looks like, and what is up and coming for future reference ... in fact I use the 'gss' alias to find out whats going on, regularly, i.e. "git fetch --all && gss" - doing this regularly, and even historically logging it to a file on login, helps see activity in the repo without too much digging. I just watch the hashes.
What's the subversion equivalents to these commands?
superficial. If I have to unfuck the backend 10 times a week in our API adapter, then these commands will show me constantly changing the API adapter, although it's the backend team constantly fixing their own bugs
Nice timing. I was just today needing some of the info that these commands surface. Serendipitous!
Nice! Will probably adopt this, seems to give a great overview!
So you value more rushed descriptions of changes than actual changes. Nice
No searching the codebase/commits for "fuck" and shit"? That will give you an idea what what was put in under stressful circumstances like a late night during a crunch.
This is a great list of commands to quickly understand a repository. Thank you for sharing.
I'm so used to magit, it seems kind of primitive to pipe git output around like this.
Anyway, I can glean a lot of this information in a few minutes scrolling through and filtering the log in magit, and it doesn't require memorizing a bunch of command line arguments.
thank you - these are useful
Just looking at how often a file changes without knowing how big the file is seems a bit silly. Surely it should be changes/line or something?
Sure, normalizing by size would be more precise. But this is a quick gut check to know which files to look at first, not a metric.
Step 6: grep the thread count on the squash-merge debate to determine if the team has unresolved interpersonal conflict.
blog posts are just comments that would have been torn apart if only posted on a forum, now masquerading as important universal edicts
This should be renamed to "Git commands that I run as a new hire to get metrics I'll forget on day 2".
> The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”
I've got my Emacs set up to display next to every file that is versioned the number of commits that file has been modified in (for the curious: using a modified all-the-icons-ivy-rich + custom elisp code + custom Bash scripts I wrote and it's trickier than it seems to do in a way that doesn't slows everything down). For example in the menu to open a file or open a recently visited file etc.: basically in every file list, in addition to its size, owner, permissions, etc. I also add the number of commits if it's a versioned file.
I like the fix/bug/broken search in TFA to see where the bugs gather.
I was curious what information I could glean from these for some popular repos. Caveat: I'm primarily an low-level embedded developer so I don't interface with large open source projects at the source level very often (other than occasionally the linux kernel). I chose some projects at random that I use.
*Mainline linux*
Most changed files: pretty much what I expected for 1 and 2... the "cutting edge" of Linux development over other OSes -- bpf and containers. The bpf verifier and AMD GPU driver might get a boost in this list due to sheer LoCs in those files (26K and 14K respectively). An intel equivalent of amdgpu_dm is #21 in the list (drivers/gpu/drm/i915/display/intel_display.c) and nvidia is nowhere to be seen (presumably due to out-of-tree modules/blobs?).
Bus factor: obviously none. The top 4
Buggy files: Intel comes out on top of GPU drivers this time (twice). Along with KVM for x86(64), the main allocator, and BTRFS.
*GCC*
Most changed files: IR autovectorization code, riscv heuristics tables, and C++ template handling (pt.c is "paramaterized types").
Buggy files: DWARF debuginfo generation, x86 heuristics tables, RS6000(?!) heuristic tables. I had to look up RS6000, it's an IBM instruction set from the 90s lol. cp-tree.h is an interesting file, it seems be the main C(++) AST datastructures.
*xfwm4* Most changed files: the list is dominated by *.po localizations. I filtered these out. Even after this, I discovered there is very little active development in the last few years. If I extend to 4 years ago, I get: 1. src/client.c - Realizing this project is too "small" to glean much from this. client.c is just the core X client management code. Makes sense. 2. src/placement.c - Other core window management code.
This has not told me much other than where most of the functionality of this project lies.
Bus factor: Pretty huge. Not really an issue in this case due to lack of development I guess.
Files with bug commits: Very similar distribution to most changed files. Not enough datapoints in this one to draw any big conclusions.
I think these massive open projects (excl xfwm) are generally pretty consistent code quality across the heavily trodden areas because of the amount of manpower available to refactor the pain points. I've yet to see an example of "god help you if you have to change that file" in e.g. linux, but I have of course seen that situation many times in large proprietary codebases.
This is better than the post itself. Showing real output from a real repo.
Big projects tend to self-correct. These commands hit differently on private codebases with 3-10 contributors, where high-churn usually means one person patching the same thing repeatedly.
More AI slop.
Wtf is happening to this website
Most often, what is happening is that people are groundlessly accusing others of writing AI slop.
[dead]
[dead]
[dead]
[dead]
[flagged]
git commands I run before reading any code:
git rm -rf .