Comment by const_cast
1 day ago
IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.
So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.
The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.
This makes no sense with the string type. String is text, but now we don't have text. That's a problem.
We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!
The approach you are advocating is the approach that was abandoned, for good reasons, in the Unix filesystem in the 70s and in Perl in the 80s.
One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.
Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).
As I demonstrated in https://news.ycombinator.com/item?id=44991638, it's easy to run into this problem in, for example, Rust.
Not all text is UTF-8, and there are real world contexts (e.g. Windows) where this matters a lot.
Yes, Windows text is broken in its own special way.
We can try to shove it into objects that work on other text but this won't work in edge cases.
Like if I take text on Linux and try to write a Windows file with that text, it's broken. And vice versa.
Go let's you do the broken thing. In Rust or even using libraries in most languages, you cant. You have to specifically convert between them.
That's why I mean when I say "storing random binary data as text". Sure, Windows almost UTF16 abomination is kind of text, but not really. Its its own thing. That requires a different type of string OR converting it to a normal string.
Even on Linux, you can't have '/' in a filename, or ':' on macOS. And this is without getting into issues related to the null byte in strings. Having a separate Path object that represents a filename or path + filename makes sense, because on every platform there are idiosyncratic requirements surrounding paths.
It maybe legacy cruft downstream of poorly thought out design decisions at the system/OS level, but we're stuck with it. And a language that doesn't provide the tooling necessary to muddle through this mess safely isn't a serious platform to build on, IMHO.
There is room for languages that explicitly make the tradeoff of being easy to use (e.g. a unified string type) at the cost of not handling many real world edge cases correctly. But these should not be used for serious things like backup systems where edge cases result in lost data. Go is making the tradeoff for language simplicity, while being marketed and positioned as a serious language for writing serious programs, which it is not.
5 replies →