Comment by Blackthorn

8 years ago

It still seems like a gross mismatch of power though. Correct me if I'm wrong but Ragel only can output parsers for regular languages, yes? You can't call their Ragel code an HTML parser because Ragel can't output a parser powerful enough to parse HTML.

HTML isn't a CFG. The HTML spec is setup as a state machine ( = regular language) + a number of side data structures like the stack of open elements and list of active formatting elements. This maps very easily to Ragel, where your actions can easily have side-effects and reference internal state within the language.

  • > HTML isn't a CFG. The HTML spec is setup as a state machine ( = regular language) + a number of side data structures like the stack of open elements and list of active formatting elements.

    That's...that's what a context-free grammar is.

    (FWIW, wild-type html might not be context-free but require a higher powered parser.)

    • Operations on the stack of open elements don't have to follow the same LIFO discipline that a set of recursive productions (i.e. a CFG) would generate. For example, parts of the HTML5 parsing algorithm (eg. the Adoption Agency Algorithm) involve conditionally popping elements off the stack of open elements while they appear in the list of open formatting elements (which is itself a FIFO queue), and then pushing the popped elements back onto the list of open formatting elements. Other parts (eg. tables) involve re-parenting elements into parents that are currently on the stack of open elements.

      (I've written a fairly well-used conforming HTML5 parser, so I do have some domain knowledge in this area...)

      1 reply →