Comment by lindig

7 hours ago

> Instead of regular expressions, Janet’s text wrangling is based around parsing expression grammars. Parsing expression grammars are simpler, more powerful, and more predictable than regular expressions.

I would dispute that this is the case. In PEGs, alternatives are not commutative, unlike in regular expressions. This can lead to quite frustrating debugging. While a valid choice, the advantage over REs is overstated.

I'd argue they are not commutive in regexes either, at least as implemented in practice. Implemented regexes favor the leftmost alternative even when both sides of the alternative match. This matters in cases like: capturing groups, and backtracking implementations. There absolutely are cases where one ordering of alternatives could yield catastrophic backtracking for some input, while the other will avoid it completely.

I personally don't like this at all. This means that regex engines that try to generate optimized matching code for an expression can end up generating suboptimal code if you don't want alternative order to matter, since the engine needs to keep that invariant, except in the case when it can prove that the alternatives won't overlap, and a later one can be checked in constant time. If both are true, it is legal to reorder them to do the constant time check before the big complicated wildcard-filled alternative.

But personally, I have never written a regex where I actively cared about the alternative evaluation order. I've used some other people made where order is important but never written one myself.

I'd love to be able to tell the engine "feel free to swap the evaluation order of my alternatives while optimizing", but few if any such engines offer that as a feature.

Now I get that PEGs have commutivity problems are that are different from regexes', which make the issue worse, but that doesn't mean regexes do things right either.

The non-commutavity is a feature, not a bug. It allows you to have clearly defined parsing for grammars that would traditionally be considered ambiguous.

  • But that’s not how humans writing code generally think about ambiguous expressions. You can see that by how few precedence rules programmers tend to internalize, and often prefer extra parentheses to make sure that the parser interprets it the way they mean.

    • I am not talking about operator precedence, that’s a separate thing. Consider the parsing of math expressions, where juxtaposition of terms denotes multiplication unless it can be interpreted as something else. So f(x+2) is function application, whereas 3(x+2) is multiplication. With a PEG, you just write a standard expression grammar with an additional choice at the end for the implicit multiplication. With a conventional parser, this is much more difficult – you have to explicitly list all the different ways that terms could be next to each other without meaning something else.

PEGs are just soooo much easier to read than regexes for anything more complex than a few words or single line matching. REs are a hammer that tempts people to see everything as a nail, but once one progresses beyond that phase one usually wants as few REs as possible.

  • That’s mostly an artifact of concise regex syntax I believe. When you write regular expressions in an ABNF-like form, they become much more intelligible.

Came here for this comment. Janet would score positively in my mind if the evolutionary dead-end PEG were replaced with a grammar parser that is known to work under all circumstances.