Comment by kragen

1 day ago

There are a lot of operations that are valid and well-defined on binary strings, such as sorting them, hashing them, writing them to files, measuring their lengths, indexing a trie with them, splitting them on delimiter bytes or substrings, concatenating them, substring-searching them, posting them to ZMQ as messages, subscribing to them as ZMQ prefixes, using them as keys or values in LevelDB, and so on. For binary strings that don't contain null bytes, we can add passing them as command-line arguments and using them as filenames.

The entire point of UTF-8 (designed, by the way, by the group that designed Go) is to encode Unicode in such a way that these byte string operations perform the corresponding Unicode operations, precisely so that you don't have to care whether your string is Unicode or just plain ASCII, so you don't need any error handling, except for the rare case where you want to do something related to the text that the string semantically represents. The only operation that doesn't really map is measuring the length.

15 comments

kragen

xyzzyz 1 day ago

> There are a lot of operations that are valid and well-defined on binary strings, such as (...), and so on.

Every single thing you listed here is supported by &[u8] type. That's the point: if you want to operate on data without assuming it's valid UTF-8, you just use &[u8] (or allocating Vec<u8>), and the standard library offers what you'd typically want, except of the functions that assume that the string is valid UTF-8 (like e.g. iterating over code points). If you want that, you need to convert your &[u8] to &str, and the process of conversion forces you to check for conversion errors.

maxdamantus 1 day ago
The problem is that there are so many functions that unnecessarily take `&str` rather than `&[u8]` because the expectation is that textual things should use `&str`.
So you naturally write another one of these functions that takes a `&str` so that it can pass to another function that only accepts `&str`.
Fundamentally no one actually requires validation (ie, walking over the string an extra time up front), we're just making it part of the contract because something else has made it part of the contract.
- kragen 1 day ago
  
  It's much worse than that—in many cases, such as passing a filename to a program on the Linux command line, correct behavior requires not validating, so erroring out when validation fails introduces bugs. I've explained this in more detail in https://news.ycombinator.com/item?id=44991638.
kragen 1 day ago
That's semantically okay, but giving &str such a short name creates a dangerous temptation to use it for things such as filenames, stdio, and command-line arguments, where that process of conversion introduces errors into code that would otherwise work reliably for any non-null-containing string, as it does in Go. If it were called something like ValidatedUnicodeTextSlice it would probably be fine.
- adastra22 1 day ago
  
  I'd agree if it was &[bytes] or whatever. But &[u8] isn't that much different from &str.
  
  5 replies →
- xyzzyz 1 day ago
  
  It's actually extremely hard to introduce problems like that, precisely because Rust's standard library is very well designed. Can you give an example scenario where it would be a problem?
  
  3 replies →

gf000 1 day ago

Then [u8] can surely implement those functions.