Comment by ajross
6 days ago
Nitpick: the DSL here ("ini file format") is arguably ad-hoc, but it's extremely common and well-understood, and simple enough to make a common law specification work well enough in practice. The bug here wasn't due to the format. What you're actually complaining about is the hand-coded parser[1] sitting in the middle of a C program like a bomb waiting to go off. And, yes, that nonsense should have died decades ago.
There are places for clever hand code, even in C, even in the modern world. Data interchange is very not much not one of them. Just don't do this. If you want .ini, use toml. Use JSON if you don't. Even YAML is OK. Those with a penchant for pain like XML. And if you have convinced yourself your format must be binary (you're wrong, it doesn't), protobufs are there for you.
But absolutely, positively, never write a parser unless your job title is "programming language author". Use a library for this, even if you don't use libraries for anything else.
[1] Fine fine, lexer. We are nitpicking, after all.
Since we are nitpicking, git is considerably older than TOML, and nearly as old as YAML and JSON. In fact JSON wasn’t even standardised until after git’s first stable release.
Back then there wasn’t a whole lot of options besides rolling their own.
And when you also consider that git had to be somewhat rushed (due to Linux kernel developers having their licenses revoked for BitKeeper) and the fact that git was originally written and maintained by Linus himself, it’s a little more understandable that he wrote his own config file parser.
Under the same circumstances, I definitely would have done that too. In fact back then, I literally did write my own config file parsers.
The parser is fine, the bug is in the encoder which writes a value ending in \r without quotes even though \r is a special character at the end of a line.
How many hand crafted lexers dealing with lf vs. cr-lf encodings do exist? My guess is n > ( number of people who coded > 10 KSLOC ).
I’ve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.
If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So you’re much better off just jumping over that character entirely.
Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.
If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.
I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.
2 replies →
Old Macs (pre-OS X I think) used CR only as line terminators.
1 reply →
I guess at this point it's okay to deprecate Mac OS 9 files?
1 reply →
More generally, any textual file format where whitespace is significant at the end of a line is calling for trouble.
5 replies →
If you're using protobuf you can still use text as it turns out [0]
[0] https://protobuf.dev/reference/protobuf/textformat-spec/