Comment by robertlagrant

2 years ago

How so?

1 comment

robertlagrant

The answer is amusing, but it seems the author either didn't read the question properly, or didn't read their formal languages textbook properly, and rushed ahead with an answer that isn't really correct

For one thing, It assumes "regex" as used in programming are the same as "regular expressions" (defining regular languages) in formal use. More info on that [1]

But the question isn't even about a full parsing of HTML, with bracket balancing. It's just about syntactically matching all the opening tags. More "lexing" than "parsing". Instinctively that does look like a simple regular language to me, though I'm not claiming certainly. The super-regularity of HTML comes from nested elements, but it's just the tag syntax this user cares about, with no context-sensitivity

One red herring is comments and CDATA sections, but since they cannot be nested, they do not change the language class, as you just transition to a skip state and back when you see the start/end markers. But they do make the expression much more ugly of course

[1] https://en.wikipedia.org/wiki/Regular_expression#Patterns_fo...