Comment by lilyball

8 months ago

I can easily see this bug happening in Rust. At some level you need to transform your data model into text to write out, and to parse incoming text. If you want to parse linewise you might use BufRead::lines(), and then write a parser for those lines. That parser won't touch CRs at all, which means when you do the opposite and write the code that serializes your data model back to lines, it's easy to forget that you need to avoid having a trailing CR, since CR appears nowhere else in your code.

4 comments

lilyball

tetha 8 months ago

And - having dealt with parser construction in university for a few months - the only real way to deal with this is fuzzing and round trip tests.

It sounds defeatist, but non-trivial parsers end up with a huge state space very quickly - and entirely strange error situations and problematic inputs. And "non-trivial" starts a lot sooner than one would assume. As the article shows, even "one element per line" ends up non-trivial once you support two platforms. "foo\r\n" could be tokenized/parsed in 3 or even 4 different ways or so.

It just becomes worse from there. And then Unicode happened.

HappMacDonald 8 months ago

Well the question then becomes "how do you identify the quoting that needs to happen on the line" and tactics common in Rust enabled by features available in Rust will still lead a person away from this pattern of error.

One tool I'd have probably reached for (long before having heard of this particular corner case to avoid) would have been whitespace trimming, and CR counts as whitespace. Plus folk outside of C are also more likely to aim a regex at a line they want to parse, and anyone who's been writing regex for more than 5 minutes gets into the habit of adding `\s*` adjacent to beginning of line and end of line markers (and outside of capture groups) which in this case achieves the same end.

lilyball 8 months ago

You're describing a different format entirely then if you're doing generic whitespace trimming without any consideration for the definition of "whitespace". The Git config format explicitly defines ignorable whitespace as spaces and horizontal tabs, and says that these whitespace characters are trimmed from values, which means nothing else gets trimmed from values. If you try to write a parser for this using a regular expression and `\s*` then you'd better look up what `\s` means to your regex engine because it almost certainly includes more than just SP and HT.
I can't think of any features in Rust that will lead someone away from this pattern of error, where this pattern of error is not realizing that round-tripping the serialized output back through the deserializer can change the boundaries of line endings. It's really easy to think "if I have a bunch of single-line strings and I join them with newlines I now have multiline text, and I can split that back up into individual lines and get back what I started with". This is doubly true if you start with a parser that splits on newline characters and then change it after the fact to use BufRead::lines() in response to someone telling you it doesn't work on Windows.
wizzwizz4 8 months ago

I've been writing regular expressions for at least 8 years, and I'm not sure I've ever written `\s*`.