← Back to context

Comment by nasretdinov

1 day ago

Note that Go strings can be invalid UTF-8, they dropped panicking on encountering an invalid UTF string before 1.0 I think

This also epitomizes the issue. What's the point of having `string` type at all, if it doesn't allow you to make any extra assumptions about the contents beyond `[]byte`? The answer is that they planned to make conversion to `string` error out when it's invalid UTF-8, and then assume that `string`s are valid UTF-8, but then it caused problems elsewhere, so they dropped it for immediate practical convenience.

  • Rust apparently got relatively close to not having &str as a primitive type and instead only providing a library alias to &[u8] when Rust 1.0 shipped.

    Score another for Rust's Safety Culture. It would be convenient to just have &str as an alias for &[u8] but if that mistake had been allowed all the safety checking that Rust now does centrally has to be owned by every single user forever. Instead of a few dozen checks overseen by experts there'd be myriad sprinkled across every project and always ready to bite you.

  • Why not use utf8.ValidString in the places it is needed? Why burden one of the most basic data types with highly specific format checks?

    It's far better to get some � when working with messy data instead of applications refusing to work and erroring out left and right.

    • IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.

      So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.

      The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.

      This makes no sense with the string type. String is text, but now we don't have text. That's a problem.

      We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!

      10 replies →

  • I've always thought the point of the string type was for indexing. One index of a string is always one character, but characters are sometimes composed of multiple bytes.

    • Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.

    • You can't do that in a performant way and going that route can lead to problems, because characters (= graphemes in the language of Unicode) generally don't always behave as developers assume.

  • string is just an immutable []byte. It's actually one of my favorite things about Go that strings can contain invalid utf-8, so you don't end up with the Rust mess of String vs OSString vs PathBuf vs Vec<u8>. It's all just string

    • Rust &str and String are specifically intended for UTF-8 valid text. If you're working with arbitrary byte sequences, that's what &[u8] and Vec<u8> are for in Rust. It's not a "mess", it's just different from what Golang does.

      32 replies →

  • I think maybe you've forgotten about the rune type. Rune does make assumptions.

    []Rune is for sequences of UTF characters. rune is an alias for int32. string, I think, is an alias for []byte.

    • `string` is not an alias for []byte.

      Consider:

          for i, chr := range string([]byte{226, 150, 136, 226, 150, 136}) {
            fmt.Printf("%d = %v\n", i, chr)
            // note, s[i] != chr
          }
      

      How many times does that loop over 6 bytes iterate? The answer is it iterates twice, with i=0 and i=3.

      There's also quite a few standard APIs that behave weirdly if a string is not valid utf-8, which wouldn't be the case if it was just a bag of bytes.