Comment by maxdamantus

6 months ago

Ah, mentioning Windows filenames would have been useful.

I guess the issue isn't so much about whether strings are well-formed, but about whether the conversion (eg, from UTF-16 to UTF-8 at the filesystem boundary) raises an error or silently modifies the data to use replacement characters.

I do think that is the main fundamental mistake in Go's Unicode handling; it tends to use replacement characters automatically instead of signalling errors. Using replacement characters is at least conformant to Unicode but imo unless you know the text is not going to be used as an identifier (like a filename), conversion should instead just fail.

The other option is using some mechanism to preserve the errors instead of failing quietly (replacement) or failing loudly (raise/throw/panic/return err), and I believe that's what they're now doing for filenames on Windows, using WTF-8. I agree with this new approach, though would still have preferred they not use replacement characters automatically in various places (another one is the "json" module, which quietly corrupts your non-UTF-8 and non-UTF-16 data using replacement characters).

Probably worth noting that the WTF-8 approach works because strings are not validated; WTF-8 involves converting invalid UTF-16 data into invalid UTF-8 data such that the conversion is reversible. It would not be possible to encode invalid UTF-16 data into valid UTF-8 data without changing the meaning of valid Unicode strings.

0 comments

maxdamantus

No comments yet

Contribute on Hacker News ↗