Comment by dspillett

1 day ago

> nailing down Unicode and text encodings was still considered rocket science. Now this is a solved problem

I wish…

Detecting text encoding is only easy if all you need to contend with is UTF16-with-BOM, UTF8-with-BOM, UTF8-without-BOM, and plain ASCII (which is effectively also UTF8). As soon as you might see UTF16 or UCS without a BOM, or 8-bit codepages other than plain ASCII (many apps/libs assume that these are always CP1252, a superset of the printable characters of ISO-8859-1, which may not be the case), things are not fully deterministic.

Thankfully UTF8 has largely won out over the many 8-bit encodings, but that leaves the interesting case of UTF8-with-BOM. The standard recommends against using it, that plain UTF8 is the way to go, but to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM (otherwise it assumes CP 1252 and characters above 127 are corrupted). But… some apps/libs are completely unaware that UTF8-with-BOM is a thing at all so they load such files with the first column header corrupted.

Source: we have clients pushing & pulling (or having us push/pull) data back & forth in various CSV formats, and we see some oddities in what we receive and what we are expected to send more regularly than you might think. The real fun comes when something at the client's end processes text badly (multiple steps with more than one of them incorrectly reading UTF8 as CP1252, for example) before we get hold of it, and we have to convince them that what they have sent is non-deterministically corrupt and we can't reliably fix it on the receiving end…

14 comments

dspillett

josephg 1 day ago

> to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM

Ah so that’s the trick! I’ve run into this problem a bunch of times in the wild, where some script emits csv which works on the developers machine but fails strangely with real world data.

Good to know there’s a simple solution. I hope I remember your comment next time I see this!

silon42 1 day ago
Excel CSV is broken anyway, since in some (EU, ...) countries it needs ; as separator.
- OptionOfT 1 day ago
  
  That's not an excel issue. That's a locale issue.
  Due to (parts of?) the EU using then comma as the decimal separator, you have to use another symbol to separate your values.
  
  3 replies →
- dspillett 1 day ago
  
  A lot of the time when people say CSV they mean “character separated values” rather than specifically “comma separated values”.
  In the text files we get from clients we sometimes see tab used instead of comma, or pipe. I don't think we've seen semicolon yet, though our standard file interpreter would quietly cope¹ as long as there is nothing really odd in the header row.
  --------
  [1] it uses the heuristic “the most common non-alpha-numeric non-space non-quote character found in the header row” to detect the separator used if it isn't explicitly told what to expect

7bit 1 day ago

The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!

dspillett 6 hours ago

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.
One of the key advantages of UTF8 is that all ASCII content is effectively UTF-8. Having the BOM present reduces that convenience a bit, and a file starting with the three bytes 0xEF,0xBB,0xBF may be mistaken by some tools for a binary file rather than readable text.
dspillett 1 day ago
> The US doesn't have any of these issues
I think you mean “the US chooses to completely ignore these issues and gets away with it because they defined the basic standard that is used, ASCII, way-back-when, and didn't foresee it becoming an international thing so didn't think about anyone else” :)
- hulitu 10 hours ago
  
  > because they defined the basic standard that is used, ASCII
  I thought it was EBCDIC /s

capitainenemo 1 day ago

From wikipedia...

    UTF-8 always has the same byte order,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8...
    Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file. ...
   A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text.

That last one is a weaker point but it is true that with CSV a BOM is more likely to do harm, than good.

g-b-r 1 day ago
Indeed, I've been using the BOM in all my text files for maybe decades now, those who wrote the recommendation are clearly from an English country
- dspillett 1 day ago
  
  > are clearly from an English country
  One particular English-speaking country… The UK has issues with ASCII too, as our currently symbol (£) is not included. Not nearly as much trouble as non-English languages due to the lack of accents & such that they need, but we are still affected.