Comment by Blackthorn

8 years ago

> HTML isn't a CFG. The HTML spec is setup as a state machine ( = regular language) + a number of side data structures like the stack of open elements and list of active formatting elements.

That's...that's what a context-free grammar is.

(FWIW, wild-type html might not be context-free but require a higher powered parser.)

Operations on the stack of open elements don't have to follow the same LIFO discipline that a set of recursive productions (i.e. a CFG) would generate. For example, parts of the HTML5 parsing algorithm (eg. the Adoption Agency Algorithm) involve conditionally popping elements off the stack of open elements while they appear in the list of open formatting elements (which is itself a FIFO queue), and then pushing the popped elements back onto the list of open formatting elements. Other parts (eg. tables) involve re-parenting elements into parents that are currently on the stack of open elements.

(I've written a fairly well-used conforming HTML5 parser, so I do have some domain knowledge in this area...)

  • Fair enough...I'll cop to not knowing much about HTML5, my knowledge stopped with HTML4. I went and referenced the parsing rules and they are quite the mess. There's no official grammar for HTML5 that I can find and even if there was, I don't think there's a solid parser generator out there that handles anything more powerful than context-free grammars.