Comment by nemetroid

4 years ago

The example is very confusing. According to the output of diffsitter, the diff is (in order):

1. "let x = 1" was removed

2. "fn add_one {" was removed

3. a closing brace was added

4. "fn addition() {" was added

5. "fn add_two() {" was added

(1) is neat, (2) is reasonable. (3) showing up before both (4) and (5) is super weird. Is there a bug, or does this just demonstrate a disconnect between how a parser parses the code and how a human parses it?

6 comments

nemetroid

comex 4 years ago

Indeed, the example seems a little underwhelming.

It does demonstrate that diffsitter can tell that `fn main() {` was not semantically changed at all, while still being able to cope with syntactically invalid code like `fn add_one {`.

But look at the following, where I've attempted to manually convert diffsitter's generated diff into a unified diff format, making it easier to understand what's going on:

    1   fn main() {
    2 -     let x = 1;
    3 - fn add_one {
    4 + }
    5 + fn addition() {
    6   }
    7 + fn add_two() {
    8   }

Look at the closing brace on line 6, which doesn't have a - or + in front of it, meaning that the diff tool has found a closing brace in the old code and one in the new code and decided that they correspond to each other. But this correspondence is not syntactically meaningful. In the old code, that closing brace belongs to `fn main`. In the new code, however, it belongs to `fn addition`, while `fn main`'s closing brace is now represented by the "inserted" closing brace on line 4!

Mixing up braces this way is a typical weakness of traditional diff tools. I'd hope that a syntax-aware diff tool could prevent such mix-ups, and instead generate a diff like:

    1   fn main() {
    2 -     let x = 1;
    3   }
    4 - fn add_one {
    5 - }
    6 + fn addition() {
    7 + }
    8 + fn add_two() {
    9 + }

Even though this diff is one line longer (while diff tools typically aim to produce the shortest possible diff), it's more semantically correct, and would be much easier to deal with if a merge conflict came into play.

Unfortunately, diffsitter does not do this, at least not in that demo. And I don't see how it would be [edit: caused by] a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.

feanaro 4 years ago
> And I don't see how it would be because of a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.
I don't understand the argument you made here. Why couldn't it do this?
Compared to the first snippet, in the second snippet:
1. There is still a function called `main`, but it has no lines, compared to a single line before. The conclusion is that this line was removed.
2. There is no longer an `add_one` function.
3. There are two new functions, `addition` and `add_two`.
This is exactly what your wanted diff is showing and all of those things can be determined by a parser.
- comex 4 years ago
  
  I agree. I think you misinterpreted my comment. By "I don't see how it would be because of a disconnect", I meant "I don't see how the issue could be caused by a disconnect". (Maybe you read it as "I don't see how diffsitter would do that, because there's a disconnect"?) The suggestion of a disconnect came from the parent comment, and I was attempting to refute it.
  
  1 reply →
marcodiego 4 years ago

Definitely underwhelming. Actually I couldn't see how it is better than "diff -u".

affyboi 4 years ago

Probably more just a function of my poor heuristics (or lack thereof), diffsitter does equality comparisons on tree sitter nodes, which can also cause weird stuff like this where text gets grouped together.

I'm redoing the diffing algo anyways, so I'll keep this in mind for the next go.

Edit: it's because I decided to split line-by-line a while back because I thought it would make the diffs easier to read (rather than squashing all of the tokens together, which is not very aesthetically pleasing), but also gives us unintuitive results like this.