← Back to context

Comment by perching_aix

13 hours ago

It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?

Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.

What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.

The thing is, even when parsing html "correctly" (whatever that is) regexes will still be used. Sure, There will be a bunch of additional structures and mechanisms involved, but you will be identifying tokens via a bunch of regexes.

So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.

I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

  • No, that is not valid. The "<" and ">" characters in string values must always be escaped with &lt; and &gt;. The correct form would be:

        <a href="/" title="&lt;a /&gt; /&gt;"></a>

  • If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

    • For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

        <!—- Don't count <hr> this! -—> but do count <hr> this -->
      

      and

        <!-- <!-- Ignore <ht> this --> but do count <hr> this —->
      

      Now your regex has to include balanced comment markers. Solve that

      You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

      Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

      3 replies →