Comment by simiones
14 days ago
> I'm glad they didn't go with the idiotic Go approach ("every path is a valid UTF-8 string" or we just garble the path at the standard library level")
Can you expound a bit on this? I haven't been able to find any articles related to this kind of problem. It's also a bit surprising, given that Go specifically did not make the same choice as Rust to make strings be Unicode / UTF-8 (Go strings are just arrays of bytes, with one minor exception related to iteration using the range syntax).
Go's docs put it like this: Path names are UTF-8-encoded, unrooted, slash-separated sequences of path elements, like “x/y/z”. If you operate on a path that's a non-UTF-8 string, then Go will do... something to make the string work with UTF-8 when passed back to standard file methods, but it likely won't end up operating on the same file.
Rust has OsStr to represent strings like paths, with a lossy/fallible conversion step instead.
Go's approach is fine for 99% of cases, and you're pretty screwed if your application falls for the 1% issue. Go has a lot of those decisions, often to simplify the standard library for most use cases most people usually run into (like their awful, lossy, incomplete conversion between Unix and Windows when it comes to permissions/read-only flags/etc.).
> Path names are UTF-8-encoded, unrooted, slash-separated sequences of path elements, like “x/y/z”
This is only for the "io/fs" package and its generic filesystem abstractions. The "os" package, which always operates on the real filesystem, doesn't actually specify how paths are encoded, nor does its associated helper package "path/filepath".
In practice, non-UTF-8 already wasn't an issue on Unix-like systems, where file paths are natively just byte sequences. You do need to be aware of this possibility to avoid mangling the paths yourself, though. The real problem was Windows, where paths are actually WTF-16, i.e. UTF-16 with unpaired surrogates. Go has addressed this issue by accepting WTF-8 paths since Go 1.21: https://github.com/golang/go/issues/32334#issuecomment-15500...
The `os` package, that is the main way everyone I've seen opens and reads files in Go, doesn't specify any restriction on its path syntax (except that it uses `string`, of course). I've tried using it on Linux with a file name that would be invalid UTF-8 and it works without any issues.
I for one hadn't even heard of the io/fs package that has the problems that you mention, and I don't remember ever seeing it used in an example. I've looked in a code base I help maintain, and the only uses I could find are related to some function type definitions that are used by filepath.WalkDir and filepath.Walk - and those functions explicitly document the fact that they don't use `io/fs` style paths when calling these functions - they don't even respect the path separator format:
Where fs.WalkDirFunc is defined like this:
> Go strings are just arrays of bytes,
https://go.dev/ref/spec#String_types: “A string value is a (possibly empty) sequence of bytes”
https://pkg.go.dev/strings@go1.26.2: “Package strings implements simple functions to manipulate UTF-8 encoded strings.”
So, yes, Go strings are just arrays of bytes in the language, but in the standard library, they’re supposed to be UTF-8 (the documentation isn’t immediately clear on how it handles non-UTF-8 strings).
I think this may be why the OP thinks the Go approach is “every path is a valid UTF-8 string”