Comment by ninjaoxygen

6 days ago

Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.

If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.

I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.

2 comments

ninjaoxygen

hnlmorg 6 days ago

Why would ignoring CR lead to problems? It has nothing to do with line feeding on any system released in the last quarter of a century.

If you’re targeting iMacs or the Commodore 64, then sure, it’s something to be mindful of. But I’d wager you’d have bigger compatibility problems before you even get to line endings.

Is there some other edge cases regarding CR that I’ve missed? Or are you thinking ultra defensively (from a security standpoint)?

That said, I do like your suggestion of treating CR like LF where the schema isn’t sensitive to line numbering. Unfortunately for my use case, line numbering does matter somewhat. So would be good to understand if I have a ticking time bomb

astrobe_ 5 days ago

Best option is to treat anything with ASCII code < 0x20 (space) as (white)space, but one doesn't have the chance often enough, unfortunately.