Comment by klodolph
1 day ago
This is one of those things that kind of bugs me about, say, OsStr / OsString in Rust. In theory, it’s a very nice, principled approach to strings (must be UTF-8) and filenames (arbitrary bytes, almost, on Linux & Mac). In practice, the ergonomics around OsStr are horrible. They are missing most of the API that normal strings have… it seems like manipulating them is an afterthought, and it was assumed that people would treat them as opaque (which is wrong).
Go’s more chaotic approach to allow strings to have non-Unicode contents is IMO more ergonomic. You validate that strings are UTF-8 at the place where you care that they are UTF-8. (So I’m agreeing.)
The big problem isn't invalid UTF-8 but invalid UTF-16 (on Windows et al). AIUI Go had nasty bugs around this (https://github.com/golang/go/issues/59971) until it recently adopted WTF-8, an encoding that was actually invented for Rust's OsStr.
WTF-8 has some inconvenient properties. Concatenating two strings requires special handling. Rust's opaque types can patch over this but I bet Go's WTF-8 handling exposes some unintuitive behavior.
There is a desire to add a normal string API to OsStr but the details aren't settled. For example: should it be possible to split an OsStr on an OsStr needle? This can be implemented but it'd require switching to OMG-WTF-8 (https://rust-lang.github.io/rfcs/2295-os-str-pattern.html), an encoding with even more special cases. (I've thrown my own hat into this ring with OsStr::slice_encoded_bytes().)
The current state is pretty sad yeah. If you're OK with losing portability you can use the OsStrExt extension traits.
Yeah, I avoided talking about Windows which isn’t UTF-16 but “int16 string” the same way Unix filenames are int8 strings.
IMO the differences with Windows are such that I’m much more unhappy with WTF-8. There’s a lot that sucks about C++ but at least I can do something like
Mind you this sucks for a lot of reasons, one big reason being that you’re directly exposed to the differences between path representations on different operating systems. Despite all the ways that this (above) sucks, I still generally prefer it over the approaches of Go or Rust.
> You validate that strings are UTF-8 at the place where you care that they are UTF-8.
The problem with this, as with any lack of static typing, is that you now have to validate at _every_ place that cares, or to carefully track whether a value has already been validated, instead of validating once and letting the compiler check that it happened.
In practice, the validation generally happens when you convert to JSON or use an HTML template or something like that, so it’s not so many places.
Validation is nice but Rust’s principled approach leaves me high and dry sometimes. Maybe Rust will finish figuring out the OsString interface and at that point we can say Rust has “won” the conversation, but it’s not there yet, and it’s been years.
> validation generally happens when
Except when it doesn’t and then you have to deal with fucking Cthulhu because everyone thought they could just make incorrect assumptions that aren’t actually enforced anywhere because “oh that never happens”.
That isn’t engineering. It’s programming by coincidence.
> Maybe Rust will finish figuring out the OsString interface
The entire reason OsString is painful to use is because those problems exist and are real. Golang drops them on the floor and forces you pick up the mess on the random day when an unlucky end user loses data. Rust forces you to confront them, as unfortunate as they are. It's painful once, and then the problem is solved for the indefinite future.
Rust also provides OsStrExt if you don’t care about portability, which greatly removes many of these issues.
I don’t know how that’s not ideal: mistakes are hard, but you can opt into better ergonomics if you don’t need the portability. If you end up needing portability later, the compiler will tell you that you can’t use the shortcuts you opted into.
5 replies →
It's completely in-line with Rust's approach. Concentrate on the hard stuff that lifts every boat. Like the type system, language features, and keep the standard library very small, and maybe import/adopt very successful packages. (Like once_cell. But since removing things from std is considered a forever no-no, it seems path handling has to be solved by crates. Eg. https://github.com/chipsenkbeil/typed-path )