Comment by 7bit

3 months ago

The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!

11 comments

7bit

dspillett 2 months ago

> The US doesn't have any of these issues

I think you mean “the US chooses to completely ignore these issues and gets away with it because they defined the basic standard that is used, ASCII, way-back-when, and didn't foresee it becoming an international thing so didn't think about anyone else” :)

hulitu 2 months ago

> because they defined the basic standard that is used, ASCII
I thought it was EBCDIC /s

capitainenemo 3 months ago

From wikipedia...

    UTF-8 always has the same byte order,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8...
    Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file. ...
   A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text.

That last one is a weaker point but it is true that with CSV a BOM is more likely to do harm, than good.

dspillett 2 months ago

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

One of the key advantages of UTF8 is that all ASCII content is effectively UTF-8. Having the BOM present reduces that convenience a bit, and a file starting with the three bytes 0xEF,0xBB,0xBF may be mistaken by some tools for a binary file rather than readable text.

7bit 2 months ago
Did you read past the first sentence I wrote?
ASCII does not work for any country than the US, making it a shit encoding.
- cindyllm 2 months ago
  
  [dead]

wtetzner 2 months ago

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

Adding a BOM makes it incompatible with ASCII, which is one of the benefits of using UTF-8.

7bit 2 months ago
Another one who fails to read past my first sentence...
- wtetzner 2 months ago
  
  I read past your first sentence, but ASCII is used by non English speaking countries for many things. Source code, for one.

g-b-r 3 months ago

Indeed, I've been using the BOM in all my text files for maybe decades now, those who wrote the recommendation are clearly from an English country

dspillett 2 months ago

> are clearly from an English country
One particular English-speaking country… The UK has issues with ASCII too, as our currently symbol (£) is not included. Not nearly as much trouble as non-English languages due to the lack of accents & such that they need, but we are still affected.