Comment by tiagod

13 hours ago

I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

6 comments

tiagod

chungy 11 hours ago

No, that is not valid. The "<" and ">" characters in string values must always be escaped with < and >. The correct form would be:

    <a href="/" title="&lt;a /&gt; /&gt;"></a>

comex 13 hours ago

If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

kstrauser 11 hours ago
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:
<!—- Don't count <hr> this! -—> but do count <hr> this -->
and
 but do count <hr> this —->
Now your regex has to include balanced comment markers. Solve that
You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
- marcosdumay 9 hours ago
  
  HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.
  
  2 replies →