← Back to context

Comment by inferiorhuman

1 day ago

Even so you end up with paper cuts like len which returns the number of bytes.

The problem with string length is there's probably at least four concepts that could conceivably be called length, and few people are happy when none of them are len.

Of the top of my head, in order of likely difficulty to calculate: byte length, number of code points, number of grapheme/characters, height/width to display.

Maybe it would be best for Str not to have len at all. It could have bytes, code_points, graphemes. And every use would be precise.

  • > The problem with string length is there's probably at least four concepts that could conceivably be called length.

    The answer here isn't to throw up your hands, pick one, and other cases be damned. It's to expose them all and let the engineer choose. To not beat the dead horse of Rust, I'll point that Ruby gets this right too.

        * String#length                   # count Unicode code units
        * String#bytes#length             # count bytes
        * String#grapheme_clusters#length # count grapheme clusters
    

    Similarly, each of those "views" lets you slice, index, etc. across those concepts naturally. Golang's string is the worst of them all. They're nominally UTF-8, but nothing actually enforces it. But really they're just buckets of bytes, unless you send them to APIs that silently require them to be UTF-8 and drop them on the floor or misbehave if they're not.

    Height/width to display is font-dependent, so can't just be on a "string" but needs an object with additional context.

  • Problems arise when you try to take a slice of a string and end up picking an index (perhaps based on length) that would split a code point. String/str offers an abstraction over Unicode scalars (code points) via the chars iterator, but it all feels a bit messy to have the byte based abstraction more or less be the default.

    FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.

    • You can easily treat `&str` as bytes, just call `.as_bytes()`, and you get `&[u8]`, no questions asked. The reason why you don't want to treat &str as just bytes by default is that it's almost always a wrong thing to do. Moreover, it's the worst kind of a wrong thing, because it actually works correctly 99% of the time, so you might not even realize you have a bug until much too late.

      If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.

      1 reply →

    • > but it all feels a bit messy to have the byte based abstraction more or less be the default.

      I mean, really neither should be the default. You should have to pick chars or bytes on use, but I don't think that's palatable; most languages have chosen one or the other as the preferred form. Or some have the joy of being forward thinking in the 90s and built around UCS-2 and later extended to UTF-16, so you've got 16-bit 'characters' with some code points that are two characters. Of course, dealing with operating systems means dealing with whatever they have as well as what the language prefers (or, as discussed elsewhere in this thread, pretending it doesn't exist to make easy things easier and hard things harder)