← Back to context

Comment by 7bit

1 day ago

The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

Adding a BOM makes it incompatible with ASCII, which is one of the benefits of using UTF-8.

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

One of the key advantages of UTF8 is that all ASCII content is effectively UTF-8. Having the BOM present reduces that convenience a bit, and a file starting with the three bytes 0xEF,0xBB,0xBF may be mistaken by some tools for a binary file rather than readable text.

> The US doesn't have any of these issues

I think you mean “the US chooses to completely ignore these issues and gets away with it because they defined the basic standard that is used, ASCII, way-back-when, and didn't foresee it becoming an international thing so didn't think about anyone else” :)

  • > because they defined the basic standard that is used, ASCII

    I thought it was EBCDIC /s

From wikipedia...

    UTF-8 always has the same byte order,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8...
    Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file. ...
   A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text.

That last one is a weaker point but it is true that with CSV a BOM is more likely to do harm, than good.

Indeed, I've been using the BOM in all my text files for maybe decades now, those who wrote the recommendation are clearly from an English country

  • > are clearly from an English country

    One particular English-speaking country… The UK has issues with ASCII too, as our currently symbol (£) is not included. Not nearly as much trouble as non-English languages due to the lack of accents & such that they need, but we are still affected.