Comment by pfdietz
4 days ago
The interesting problem here would be how do you produce a robust parse tree for invalid inputs, in the sense of stably parsing large sections of the text in ways that don't change too much. The tree would have to be an extension of an actual parse tree, with nodes indicating sections that couldn't be fully parsed or had errors. The diff algorithm would have to also be robust in the face of such error nodes.
For the parsing problem, maybe something like Early's algorithm that tries to minimize an error term?
You need this kind of robust parser for languages with preprocessors.
Unfortunately, this depends on making good decisions during language design; it's not something you can retrofit with a new lexer and parser.
One very important rule is: no token can span more than one (possibly backslash-extended) line. This means having neither delimited comments (use multiple single-line comments; if your editor is too dumb for this you really need a new editor) nor multi-line strings (but you can do implicit concatenation of a string literal flavor that implicitly includes the newline; as a side-effect this fixes the indentation problem).
If you don't follow this rule, you might as well give up on robustness, because how else are you going to ever resynchronize after an error?
For parsing you can generally just aggressively pop on mismatched parens, unexpected semicolons, or on keywords only allowed in a top-ish level context. Of course, if your language is insane (like C typedefs), you might not be able to parse the next top-level function/class anyway. GNU statement-expressions, by contrast, are an actually useful thing that requires some thought. But again, language design choices can mitigate this (such as making classes values, template argument equivalent to array indexing, and statements expressions).
Error recovery is a dead-end tech for all the reasons you say.
If people want to move forward they'll look past it. Garbage in, garbage out.
> how else are you going to ever resynchronize after an error?
An error-cost-minimizing dynamic programming parser could do this.
That fundamentally misunderstands the problem in multiple ways:
* this is still during lexing, not yet to parsing
* there are multiple valid token sequences that vary only with a single character at the start of the file. This is very common with Python multi-line strings in particular, since they are widely used as docstrings.
1 reply →
I think the easiest trick here is to stop thinking about it as a parsing problem and consider it only as a lexing problem. A good lexer either doesn't throw out errors or minimizes error token states, and a good lexer gets back to a regular stream of tokens as quickly as it can. This is why we trust "simple" lexers as our syntax highlighters in most IDEs, they are fast, and they handle unfinished and malformed documents just fine (we write those all the time in our processes in our editors).
My experience many years back with using just a syntax highlighting tuned lexer to build character-level diffs showed a lot of great promise: https://github.com/WorldMaker/tokdiff
Just use an AST if it parses, and fall back to plain text diffs if it doesn't.