Comment by kccqzy

2 months ago

How about another way, which is memoization: at each position in the source code we never attempt to parse the same production more than once. This solves infinite looping as discussed by the author because the “loop” will be downgraded by the memoization to execute once. Of course I wouldn't literally use a while loop in code to represent the production. I would use a higher-level abstraction to indicate one-or-more or zero-or-more in the production; indeed I would represent productions as data not code.

This also has another benefit of work sharing. A production like `A B | C B` will ensure that in case parsing A or C consumes the same number of characters, the work to parse B will be shared, despite not literally factoring the production into `(A | C) B`.

4 comments

kccqzy

luizfelberti 2 months ago

I also find this to be an elegant way of doing this, and it is also how the Thompson VM style of regex engines work [0]

It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.

[0] https://swtch.com/~rsc/regexp/regexp2.html

Porygon 2 months ago

Memoization to limit left-recursive recursion is nicely described in Guido van Rossums' article here: https://medium.com/@gvanrossum_83706/left-recursive-peg-gram...

I recently tried that approach while simultaneously building an abstract syntax tree, but I dropped it in favor of a right-recursive grammar for now, since restoring the AST when backtracking got a bit complex.

kccqzy 2 months ago

You can look at the Earley parser. It handles left recursion well well using a method that’s basically memoization.

smj-edison 2 months ago

That's a slick way, would you essentially have a second counter that you'd set to the current cursor whenever you use `.currentToken()` or something like that?