Comment by 0x000xca0xfe
1 day ago
Why not use utf8.ValidString in the places it is needed? Why burden one of the most basic data types with highly specific format checks?
It's far better to get some � when working with messy data instead of applications refusing to work and erroring out left and right.
IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.
So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.
The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.
This makes no sense with the string type. String is text, but now we don't have text. That's a problem.
We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!
The approach you are advocating is the approach that was abandoned, for good reasons, in the Unix filesystem in the 70s and in Perl in the 80s.
One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.
Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).
As I demonstrated in https://news.ycombinator.com/item?id=44991638, it's easy to run into this problem in, for example, Rust.
Not all text is UTF-8, and there are real world contexts (e.g. Windows) where this matters a lot.
Yes, Windows text is broken in its own special way.
We can try to shove it into objects that work on other text but this won't work in edge cases.
Like if I take text on Linux and try to write a Windows file with that text, it's broken. And vice versa.
Go let's you do the broken thing. In Rust or even using libraries in most languages, you cant. You have to specifically convert between them.
That's why I mean when I say "storing random binary data as text". Sure, Windows almost UTF16 abomination is kind of text, but not really. Its its own thing. That requires a different type of string OR converting it to a normal string.
6 replies →