← Back to context

Comment by johncolanduoni

1 day ago

Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.

You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.

What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.

In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).

  • Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.

    The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.

  • I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.

I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)

  • Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.

    In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.