Comment by zomglings

4 years ago

My team has a similar project (Locust: https://github.com/bugout-dev/locust) where the goal is to learn the semantic meanings of code changes in git commits, GitHub PRs, etc.

Since we took git diffs as a target for semantic analysis, we have a different approach to our diffs. We start with line-by-line diffs (specifically using "git diff") and then take a semantic diff by superimposing the git diff information on top of the initial and terminal ASTs.

This makes the diff calculation cheaper because we don't have to do full diff between trees.

Haven't updated the code in a few months, but my team is actively using Locust on public GitHub repos to learn the semantics of those code bases. We do plan to do some work on it soon to make it easier to make Locust easier to use (especially as a library).

Really need to sit down and take a proper look at tree-sitter. We currently support Locust diffs for Python, Javascript, and Java, but each one is custom written and implements the same basic algorithm. It looks like tree sitter might just crush this problem for us.

2 comments

zomglings

affyboi 4 years ago

I would recommend using tree-sitter, it's probably easier not to reinvent the wheel, especially when parsing something like C++. It's also really fast

zomglings 4 years ago

Thanks, definitely planning to look at it. Would really love to implement the semantic metadata extraction once in C/Rust and then have support for a bunch of different languages.
It is especially exciting that they aim to provide useful information even in the presence of syntax errors.