Comment by jerf

1 day ago

While the general question about string encoding is fine, unfortunately in a general-purpose and cross-platform language, a file interface that enforces Unicode correctness is actively broken, in that there are files out in the world it will be unable to interact with. If your language is enforcing that, and it doesn't have a fallback to a bag of bytes, it is broken, you just haven't encountered it. Go is correct on this specific API. I'm not celebrating that fact here, nor do I expect the Go designers are either, but it's still correct.

This is one of those things that kind of bugs me about, say, OsStr / OsString in Rust. In theory, it’s a very nice, principled approach to strings (must be UTF-8) and filenames (arbitrary bytes, almost, on Linux & Mac). In practice, the ergonomics around OsStr are horrible. They are missing most of the API that normal strings have… it seems like manipulating them is an afterthought, and it was assumed that people would treat them as opaque (which is wrong).

Go’s more chaotic approach to allow strings to have non-Unicode contents is IMO more ergonomic. You validate that strings are UTF-8 at the place where you care that they are UTF-8. (So I’m agreeing.)

  • The big problem isn't invalid UTF-8 but invalid UTF-16 (on Windows et al). AIUI Go had nasty bugs around this (https://github.com/golang/go/issues/59971) until it recently adopted WTF-8, an encoding that was actually invented for Rust's OsStr.

    WTF-8 has some inconvenient properties. Concatenating two strings requires special handling. Rust's opaque types can patch over this but I bet Go's WTF-8 handling exposes some unintuitive behavior.

    There is a desire to add a normal string API to OsStr but the details aren't settled. For example: should it be possible to split an OsStr on an OsStr needle? This can be implemented but it'd require switching to OMG-WTF-8 (https://rust-lang.github.io/rfcs/2295-os-str-pattern.html), an encoding with even more special cases. (I've thrown my own hat into this ring with OsStr::slice_encoded_bytes().)

    The current state is pretty sad yeah. If you're OK with losing portability you can use the OsStrExt extension traits.

    • Yeah, I avoided talking about Windows which isn’t UTF-16 but “int16 string” the same way Unix filenames are int8 strings.

      IMO the differences with Windows are such that I’m much more unhappy with WTF-8. There’s a lot that sucks about C++ but at least I can do something like

        #if _WIN32
        using pathchar = wchar_t;
        constexpr pathchar sep = L'\\';
        #else
        using pathchar = char;
        constexpr pathchar sep = '/';
        #endif
        using pathstring = std::basic_string<pathchar>;
      

      Mind you this sucks for a lot of reasons, one big reason being that you’re directly exposed to the differences between path representations on different operating systems. Despite all the ways that this (above) sucks, I still generally prefer it over the approaches of Go or Rust.

  • > You validate that strings are UTF-8 at the place where you care that they are UTF-8.

    The problem with this, as with any lack of static typing, is that you now have to validate at _every_ place that cares, or to carefully track whether a value has already been validated, instead of validating once and letting the compiler check that it happened.

    • In practice, the validation generally happens when you convert to JSON or use an HTML template or something like that, so it’s not so many places.

      Validation is nice but Rust’s principled approach leaves me high and dry sometimes. Maybe Rust will finish figuring out the OsString interface and at that point we can say Rust has “won” the conversation, but it’s not there yet, and it’s been years.

      6 replies →

  • It's completely in-line with Rust's approach. Concentrate on the hard stuff that lifts every boat. Like the type system, language features, and keep the standard library very small, and maybe import/adopt very successful packages. (Like once_cell. But since removing things from std is considered a forever no-no, it seems path handling has to be solved by crates. Eg. https://github.com/chipsenkbeil/typed-path )