← Back to context

Comment by tetha

7 days ago

And - having dealt with parser construction in university for a few months - the only real way to deal with this is fuzzing and round trip tests.

It sounds defeatist, but non-trivial parsers end up with a huge state space very quickly - and entirely strange error situations and problematic inputs. And "non-trivial" starts a lot sooner than one would assume. As the article shows, even "one element per line" ends up non-trivial once you support two platforms. "foo\r\n" could be tokenized/parsed in 3 or even 4 different ways or so.

It just becomes worse from there. And then Unicode happened.