Comment by dwheeler

4 years ago

On POSIX systems file names are not strings, they are sequences of bytes. They might not be UTF-8 or have any meaning. Python3 had to hack around this, they thought they could force everything to Unicode and discovered that doesn't work.

Which makes for fun issues like there's no standard way to display a filename in Unix. A system that's, you know, all about files.

  • At least for most Linux systems (not sure about other *nix, but I expect the same?), there is a system default encoding, defined by the locale, and I think decoding the filename in that encoding and displaying the resulting string, is probably the correct way to display a filename? That seems as good as you are likely to get on any system really.

    I think for any POSIX system, either there is locale support defining the encoding, or it uses the POSIX locale, which defines the encoding (ASCII).

    Of course you need to handle cases where filenames cannot be decoded in the system encoding (probably by replacing characters that cannot be decoded), because a filename in a different encoding, or even with no valid encoding, has been used on disk. While systems can say that file names containing bytes that are not valid characters in the system's encoding are not valid file names, that doesn't stop people mounting disks with them, so the problem never goes away if you support opening media from other systems.

    What I am saying is that this is no more a Unix problem than it is a problem on any system that supports removable media.

  • That's probably because paths aren't properties of the file itself, they're helpers to reference the file.

On POSIX system file paths are C strings, which are sequences of bytes that cannot include the 0 character. UTF-8 or oher meaning is not required for something to be a string.