Comment by tetha
7 days ago
And - having dealt with parser construction in university for a few months - the only real way to deal with this is fuzzing and round trip tests.
It sounds defeatist, but non-trivial parsers end up with a huge state space very quickly - and entirely strange error situations and problematic inputs. And "non-trivial" starts a lot sooner than one would assume. As the article shows, even "one element per line" ends up non-trivial once you support two platforms. "foo\r\n" could be tokenized/parsed in 3 or even 4 different ways or so.
It just becomes worse from there. And then Unicode happened.
No comments yet
Contribute on Hacker News ↗