← Back to context

Comment by maxdamantus

1 day ago

It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

You should always be able to iterate the code points of a string, whether or not it's valid Unicode. The iterator can either silently replace any errors with replacement characters, or denote the errors by returning eg, `Result<char, Utf8Error>`, depending on the use case.

All languages that have tried restricting Unicode afaik have ended up adding workarounds for the fact that real world "text" sometimes has encoding errors and it's often better to just preserve the errors instead of corrupting the data through replacement characters, or just refusing to accept some inputs and crashing the program.

In Rust there's bstr/ByteStr (currently being added to std), awkward having to decide which string type to use.

In Python there's PEP-383/"surrogateescape", which works because Python strings are not guaranteed valid (they're potentially ill-formed UTF-32 sequences, with a range restriction). Awkward figuring out when to actually use it.

In Raku there's UTF8-C8, which is probably the weirdest workaround of all (left as an exercise for the reader to try to understand .. oh, and it also interferes with valid Unicode that's not normalized, because that's another stupid restriction).

Meanwhile the Unicode standard itself specifies Unicode strings as being sequences of code units [0][1], so Go is one of the few modern languages that actually implements Unicode (8-bit) strings. Note that at least two out of the three inventors of Go also basically invented UTF-8.

[0] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

> Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.

The way Rust handles this is perfectly fine. String type promises its contents are valid UTF-8. When you create it from array of bytes, you have three options: 1) ::from_utf8, which will force you to handle invalid UTF-8 error, 2) ::from_utf8_lossy, which will replace invalid code points with replacement character code point, and 3) from_utf8_unchecked, which will not do the validity check and is explicitly marked as unsafe.

  • But there's no option to just construct the string with the invalid bytes. 3) is not for this purpose; it is for when you already know that it is valid.

    If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

    https://doc.rust-lang.org/std/primitive.str.html#invariant

    > Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.

    • How could any library function work with completely random bytes? Like, how would it iterate over code points? It may want to assume utf8's standard rules and e.g. know that after this byte prefix, the next byte is also part of the same code point (excuse me if I'm using wrong terminology), but now you need complex error handling at every single line, which would be unnecessary if you just made your type represent only valid instances.

      Again, this is the same simplistic, vs just the right abstraction, this just smudges the complexity over a much larger surface area.

      If you have a byte array that is not utf-8 encoded, then just... use a byte array.

      16 replies →

    • > If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

      Yes, and that's a good thing. It allows every code that gets &str/String to assume that the input is valid UTF-8. The alternative would be that every single time you write a function that takes a string as an argument, you have to analyze your code, consider what would happen if the argument was not valid UTF-8, and handle that appropriately. You'd also have to redo the whole analysis every time you modify the function. That's a horrible waste of time: it's much better to:

      1) Convert things to String early, and assume validity later, and

      2) Make functions that explicitly don't care about validity take &[u8] instead.

      This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

      3 replies →

> It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

Because 99.999% of the time you want it to be valid and would like an error if it isn't? If you want to work with invalid UTF-8, that should be a deliberate choice.

  • Do you want grep to crash when your text file turned out to have a partially written character in it? 99.999% seems very high, and you haven't given an actual use case for the restriction.

    • Crash? No. But I can safely handle the error where it happens, because the language actually helps me with this situation by returning a proper Result type. So I have to explicitly check which "variant" I have, instead of forgetting to call the validate function in case of go.

    • Rust doesn't crash when it gets an error unless you tell it to. You make a choice how to handle the error because you have to it or it won't compile. If you don't care about losing information when reading a file, you can use the lossy function that gracefully handles invalid bytes.