← Back to context

Comment by Blackthorn

8 years ago

This is probably gonna get buried at this point, but one thing I'm surprised about is this seems like yet another parser bug. Why are we still using hand-written parsers? Even if you're Very Smart, you'll probably get it wrong. We have parser generators for a lot of things. Even for mostly unparseable garbage like wild-type HTML we have pretty good libraries for handling it. Fresh hand-written parsers are just bombs waiting to explode.

Your comment doesn't apply for this particular case, because the submission goes into great detail that the parser in question was written with Ragel, a parser generator. The code written by them in Ragel contained a bug, which lay uncaught and dormant for years, and manifested only when calling/wrapping code was altered.

  • It still seems like a gross mismatch of power though. Correct me if I'm wrong but Ragel only can output parsers for regular languages, yes? You can't call their Ragel code an HTML parser because Ragel can't output a parser powerful enough to parse HTML.

    • HTML isn't a CFG. The HTML spec is setup as a state machine ( = regular language) + a number of side data structures like the stack of open elements and list of active formatting elements. This maps very easily to Ragel, where your actions can easily have side-effects and reference internal state within the language.

      3 replies →

I would also expect that regularly performed fuzz testing would be able to catch such a bug? Especially if ran in combination with dynamic memory analysis like Valgrind. So were they not doing this?