Comment by assbuttbuttass
1 day ago
string is just an immutable []byte. It's actually one of my favorite things about Go that strings can contain invalid utf-8, so you don't end up with the Rust mess of String vs OSString vs PathBuf vs Vec<u8>. It's all just string
Rust &str and String are specifically intended for UTF-8 valid text. If you're working with arbitrary byte sequences, that's what &[u8] and Vec<u8> are for in Rust. It's not a "mess", it's just different from what Golang does.
If anything that will make Rust programs likely to be correct under any strange text input, while Go might just handle the happy path of ASCII inputs.
Stuff like this matters a great deal on the standard library level.
It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?
You should always be able to iterate the code points of a string, whether or not it's valid Unicode. The iterator can either silently replace any errors with replacement characters, or denote the errors by returning eg, `Result<char, Utf8Error>`, depending on the use case.
All languages that have tried restricting Unicode afaik have ended up adding workarounds for the fact that real world "text" sometimes has encoding errors and it's often better to just preserve the errors instead of corrupting the data through replacement characters, or just refusing to accept some inputs and crashing the program.
In Rust there's bstr/ByteStr (currently being added to std), awkward having to decide which string type to use.
In Python there's PEP-383/"surrogateescape", which works because Python strings are not guaranteed valid (they're potentially ill-formed UTF-32 sequences, with a range restriction). Awkward figuring out when to actually use it.
In Raku there's UTF8-C8, which is probably the weirdest workaround of all (left as an exercise for the reader to try to understand .. oh, and it also interferes with valid Unicode that's not normalized, because that's another stupid restriction).
Meanwhile the Unicode standard itself specifies Unicode strings as being sequences of code units [0][1], so Go is one of the few modern languages that actually implements Unicode (8-bit) strings. Note that at least two out of the three inventors of Go also basically invented UTF-8.
[0] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
> Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.
The way Rust handles this is perfectly fine. String type promises its contents are valid UTF-8. When you create it from array of bytes, you have three options: 1) ::from_utf8, which will force you to handle invalid UTF-8 error, 2) ::from_utf8_lossy, which will replace invalid code points with replacement character code point, and 3) from_utf8_unchecked, which will not do the validity check and is explicitly marked as unsafe.
24 replies →
> It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?
Because 99.999% of the time you want it to be valid and would like an error if it isn't? If you want to work with invalid UTF-8, that should be a deliberate choice.
3 replies →