Comment by bhl

4 years ago

Anyone know how to compute the diff between two ASTs if the edits _are_ known, either in general or with tree-sitter.

The method used by diffsitter is to compute the path with minimal editing distance: https://github.com/afnanenayet/diffsitter/blob/main/src/ast.....

In the case where the edits are known (because you're re-creating the AST per keystroke), you can save parsing time by passing in the old tree and a set of edits: https://tree-sitter.github.io/tree-sitter/using-parsers#edit....

Intuitively, I feel like there should be a way to reduce diffing time as well to avoid O(mn) run and space time that comes from the dynamic programming solution to compute the minimal edit.

5 comments

bhl

z3t4 4 years ago

I've made a incremental parser, and solved it by only re-parsing the changed code block. And the diff would then be the entire branch, but you could compare the old branch with the new branch and get a more fine detailed diff. It's hard to know beforehand what will get better performance, I found that for small files < 100 LOC it was faster to re-parse the entire file.

0b01 4 years ago

You are correct. It is indeed possible to use dynamic programming. For more info check out [0]. OP’s project probably takes too long on nontrivial merges in its current state.

[0] https://thume.ca/2017/06/17/tree-diffing/

bhl 4 years ago
That link does go over an optimized tree-diffing algorithm for ASTs, but the question I'm wondering is how to extend incremental parsing into incremental diffing, as opposed to full parse into full diff.
- the_benno 4 years ago
  
  The short answer is that it's unsolved in general and just done ad-hoc in practice. I'm working on incremental & demand-driven analysis techniques for my PhD research and currently building essentially what you describe. The state-of-the-art is probably this technique from this year's PLDI [1] -- last I heard they were using a batch parser but planning on integrating with tree-sitter at some point. [1] https://dl.acm.org/doi/10.1145/3453483.3454052
affyboi 4 years ago

Yes, it's not currently very performant for larger files (have a branch in the works to fix this). Checked out the link, it's an interesting read